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Abstract:  This  thesis  presents  a  systematic  study  of  deformable  image  transformations 
for  nonrigidly  aligning  a  template  to  an  image.  The  study  concentrates  on  an  information- 
theoretic  similarity  measure,  fluid  deformation,  and  prior  shape  constraints. 

The  similarity  between  a  target  image  and  a  template  is  measured  based  on  their  mu¬ 
tual  information.  Suitability  of  the  mutual  information  measure  for  non-rigid-body  image 
registration  is  systematically  investigated.  A  mutual  information  bound  is  derived,  and 
a  gradient  calculation,  which  scales  linearly  with  the  volume  size,  is  presented.  Modifica¬ 
tion  of  a  fluid  model  proposed  elsewhere  is  shown  to  retain  the  desirable  properties  of  the 
deformation  while  allowing  more  efficient  numerical  implementation.  Shape  information 
is  learned  by  performing  eigenshape  analysis  on  a  training  set  of  correct  deformations  of 
a  single  template  to  several  typical  segmentations.  The  most  likely  deformations  are  then 
promoted  according  to  the  learned  shape  information.  The  shape  modeling  technique 
does  not  require  a  prelabeling  or  ordering  of  points  in  the  training  set  and  can  handle 
multiple  shapes  simultaneously.  Based  on  these  results,  a  new  method  is  developed  for 
nonrigidly  aligning  a  template  to  a  study  image.  The  approach  is  robust  to  a  wide  variety 
of  contrast  variations  and  supports  large,  curved  geometric  variations. 

This  method  has  been  experimentally  validated  using  synthetic,  magnetic  resonance, 
and  cryosection  images.  It  is  also  incorporated  into  a  brain  image  segmentation  method 
under  development  at  the  University  of  Illinois.  Potential  applications  include  image 
segmentation,  functional  brain  mapping,  and  automatic  target  recognition. 
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ABSTRACT 


This  thesis  presents  a  systematic  study  of  deformable  image  transformations  for  non- 
rigidly  aligning  a  template  to  an  image.  The  study  concentrates  on  an  information- 
theoretic  similarity  measure,  fluid  deformation,  and  prior  shape  constraints. 

The  similarity  between  a  target  image  and  a  template  is  measured  based  on  their  mu¬ 
tual  information.  Suitability  of  the  mutual  information  measure  for  non-rigid-body  image 
registration  is  systematically  investigated.  A  mutual  information  bound  is  derived,  and 
a  gradient  calculation,  which  scales  linearly  with  the  volume  size,  is  presented.  Modifica¬ 
tion  of  a  fluid  model  proposed  elsewhere  is  shown  to  retain  the  desirable  properties  of  the 
deformation  while  allowing  more  efficient  numerical  implementation.  Shape  information 
is  learned  by  performing  eigenshape  analysis  on  a  training  set  of  correct  deformations  of  a 
single  template  to  several  typical  segmentations.  The  most  likely  deformations  are  then 
promoted  according  to  the  learned  shape  information.  The  shape  modeling  technique 
does  not  require  a  prelabeling  or  ordering  of  points  in  the  training  set  and  can  handle 
multiple  shapes  simultaneously.  Based  on  these  results,  a  new  method  is  developed  for 
nonrigidly  aligning  a  template  to  a  study  image.  The  approach  is  robust  to  a  wide  variety 
of  contrast  variations  and  supports  large,  curved  geometric  variations. 

This  method  has  been  experimentally  validated  using  synthetic,  magnetic  resonance, 
and  cryosection  images.  It  is  also  incorporated  into  a  brain  image  segmentation  method 
under  development  at  the  University  of  Illinois.  Potential  applications  include  image 
segmentation,  functional  brain  mapping,  and  automatic  target  recognition. 
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CHAPTER  1 


INTRODUCTION 


It  is  a  common  practice  in  pattern  recognition  problems  to  match  a  candidate  pattern 
to  one  of  perhaps  several  prototype  patterns.  For  all  but  the  simplest  applications, 
maintaining  a  prototype  database  with  all  possible  variations  of  an  input  pattern  is 
impractical.  Early  research  concentrated  on  allowing  several  simple  transformations  of  a 
prototype  pattern  to  alleviate  the  need  for  so  many  prototypical  patterns.  As  research 
has  matured  in  this  area,  allowable  transformations  have  become  more  sophisticated, 
even  to  include  a  large  number  of  nonrigid  deformations  [1,  2]. 

The  main  concepts  of  deformable  models  have  origins  dating  back  to  at  least  the 
early  1970s.  For  example,  the  works  of  Widrow  [3]  and  Fischler  and  Elschlager  [4] 
involved  rubber  masks  or  rigid  components  held  together  by  “springs”  for  use  in  elastically 
matching  a  model  to  an  image  being  studied.  Interest  in  deformable  models  has  increased 
in  recent  years  due  largely  to  two  factors,  (a)  Deformable  models  are  an  effective  way  to 
combine  low-level  image  information  with  higher-level  prior  information,  (b)  Increases 
in  computational  speeds  of  computers  have  enabled  the  creation  of  deformable  models 
with  sufficient  sophistication  to  begin  to  handle  practical  problems.  Figure  1.1  illustrates 
the  potential  use  of  a  deformable  matching  technique;  a  single  template  may  be  used  to 
describe  a  broader  class  of  candidates  than  would  a  rigid  technique. 

However,  the  area  of  deformable  models  is  not  mature.  Existing  limitations  prevent 
current  deformable  matching  techniques  from  solving  many  practical  computer  vision 
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Figure  1.1  An  example  of  template  matching.  The  example  fish  matches  only  three 
fish  (solid  lines)  using  rigid  transformations.  Deformable  transformations  may  allow  the 
example  to  match  other  fish  (dashed  lines).  (Adapted  from  [1].) 

problems.  For  example,  many  techniques  rely  on  absolute  intensity  information,  pre¬ 
cluding  use  in  an  unknown  contrast  environment.  Others  rely  on  intensity  gradient 
information  and  ignore  information  from  within  region  boundaries.  Such  methods  are 
sensitive  to  noise  and  initialization.  Additionally,  many  methods  are  formulated  for  sin¬ 
gle  object  deformation,  so  implementation  in  the  presence  of  multiple  contours,  objects, 
etc.,  is  problematic.  Finally,  in  most  cases,  deformations  are  based  on  prior  information 
from  only  a  small  number  of  localized  points.  Global  “shape”  information  is  largely 
ignored,  so  deformations  may  occur  which  are  not  consistent  with  prior  knowledge  of 
the  system.  This  dissertation  describes  a  template  deformation  method  which  presents 
significant  advances  in  overcoming  these  limitations. 

1.1  Motivation 

The  need  for  more  automated  image  processing  methods  is  pressing.  With  the  rapid 
advances  in  imaging  technology,  it  is  now  routine  to  acquire  large  image  data  sets  easily 
and  quickly;  in  fact,  the  ability  to  acquire  images  has  thus  far  surpassed  our  ability  to 
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efficiently  analyze  and  interpret  them.  This  is  particularly  true  in  the  area  of  medical 
imaging. 

Magnetic  resonance  imaging  (MRI),  x-ray  computed  tomography  (CT),  and  ultra¬ 
sonic  imaging  are  quite  commonplace,  while  current  clinical  segmentation  techniques 
are,  at  best,  semiautomated.  Magnetic  resonance  (MR)  images  are  routinely  used  to 
help  diagnose  disorders  of  the  brain;  however,  the  manual  segmentation  associated  with 
the  diagnosis  can  be  quite  time-consuming  and  prone  to  error,  bias,  and  nonrepeatability. 

For  example,  the  progression  of  some  illnesses,  such  as  AIDS,  can  be  determined  by 
measuring  changes  in  the  volume  of  white  matter,  gray  matter,  or  the  ratio  between  the 
two  [5].  Manual-based  measurements  are  potentially  biased  because  measurements  are 
taken  over  a  period  of  time  by  perhaps  different  experts.  An  automated  segmentation 
method  could  reduce  these  biases.  In  other  medical  applications  it  is  important  to  lo¬ 
calize  key  anatomical  regions  and  structures.  These  structures  can  then  be  used  to  help 
neuroscientists  examine  the  physiological  processes  in  functional  studies.  An  automated 
means  of  localizing  and  labeling  these  structures  would  certainly  benefit  researchers  in 
this  area.  The  same  information  would  also  be  useful  to  surgeons  to  aid  in  surgical 
planning. 

Figure  1.2  shows  several  medical  images  which  could  be  considered  to  belong  to  a 
single  class  because  they  were  acquired  from  the  “same”  anatomical  location  of  several 
human  brains.  As  can  be  seen  from  the  image,  some  portions  of  the  anatomy  have  a 
highly  regular  shape  —  the  scalp  and  skull  for  example  —  while  the  inner  structures 
(white  matter  and  gray  matter)  are  highly  irregular  and  can  exhibit  large,  curved  differ¬ 
ences.  Additionally,  depending  on  the  imaging  process,  the  average  graylevel  intensities 
of  corresponding  structures  in  the  different  images  (i.e.,  the  image  contrast)  differ  widely. 
These  issues  present  problems  for  existing  deformable  template  methods. 
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Figure  1.2  Several  2-D  brain  image  examples.  Images  are  axial  views  of  the  supraven¬ 
tricular  region  of  different  human  brains  using  different  imaging  parameters. 

1.2  Problem  Formulation 

Consider  a  fixed  image  being  studied  d(x)  with  an  underlying  structure  s(x),  where 
x  represents  the  domain  on  which  both  d  and  s  are  defined.  An  imaging  process  can  be 
modeled  as 


d(x)  =  C(s(x),</>)  +  77  (1.1) 

The  actual  intensity  values  in  d  are  determined  by  the  underlying  structure  s,  the  imaging 
function  C(-,  •),  exogenous  factors  4>,  and  noise  77.  The  exogenous  factors  are  factors  such 
as  lighting,  temperature,  etc.,  that  are  apart  from  the  actual  object  being  imaged,  but 
that  affect  the  intensities  in  the  image. 

The  template  deformation  problem  can  be  defined  as  follows:  Given  an  image  d(x) 
and  a  template  a(x)  find  a  coordinate  displacement  field  u*  from  a  set  of  possible  dis¬ 
placement  fields  U  such  that  when  it  is  applied  to  a,  a(x  —  u*(x))  is  most  closely  aligned 
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with  d.  More  concisely, 


u*  =  argmaxM(d,  a(u))  (1.2) 

uGU 

where  M  is  a  metric  which  measures  the  match  between  the  template  and  study. 

The  problem  is  often  formulated  in  terms  of  forces  acting  on  the  template.  The  force 
balance  equation  is 

£u  +  b  =  0  (1.3) 

Using  this  formulation,  the  template  deformation  is  dependent  on  an  external  body  force 
b,  which  is  a  function  of  M,  and  an  internal  model  C.  The  internal  model  maintains 
allowable  deformations  of  the  template,  and  the  body  force  drives  the  template  into 
alignment  with  the  study  image  data.  The  “correct”  deformation  is  assumed  when  the 
balance  of  forces  is  met.  Different  deformable  models  utilize  different  models  for  C  and 
b. 

A  desirable  property  of  an  internal  model  is  that  it  allow  deformations  which  are 
likely,  given  prior  knowledge  about  the  imaged  object,  while  rejecting  deformations  that 
contradict  the  prior  knowledge.  However,  as  is  illustrated  in  Figure  1.2,  not  all  image 
regions  will  exhibit  the  same  degree  of  variability.  Variations  in  the  contour  of  the  skull 
are  relatively  small  compared  to  the  size  of  the  skull,  but  the  white  matter  variations  can 
be  on  the  same  order  of  magnitude  as  the  white  matter  region  itself.  In  addition  to  the 
large  variations  present  in  the  white  matter,  many  of  the  finger-like  regions  are  curved, 
so  £  should  be  sufficiently  general  for  u  to  follow  large,  curved  (nonlinear)  trajectories. 

As  the  internal  model  becomes  general  enough  to  accommodate  large,  nonlinear  vari¬ 
ations,  the  dimensionality  of  the  deformation  space  increases,  thus  increasing  the  possi¬ 
bility  for  over-deformation.  Rigid-body  transformations  have  few  degrees  of  freedom  — 
on  the  order  of  10.  For  non-rigid-body  deformations  the  degrees  of  freedom  are  much 
higher  —  on  the  order  of  100,000  for  a  256  x  256  2-D  image.  This  greatly  increases 
both  the  complexity  of  the  search  space  and  the  likelihood  becoming  trapped  in  a  local 
extremum.  In  the  rigid-body  case  there  are  many  more  data  than  degrees  of  freedom,  so 
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redundancies  exist.  In  the  non-rigid-body  case,  the  data  and  degrees  of  freedom  are  on 
the  same  order  of  magnitude,  so  the  deformation  result  is  much  more  sensitive  to  noise 
and  other  uncertainties  in  the  data.  A  fundamental  challenge,  then,  is  to  define  a  model 
that  is  sufficiently  general  to  allow  likely  large,  nonlinear  deformations,  but  not  allow 
unacceptable  over-deformations. 

A  second  challenge  is  defining  a  body  force  independent  of  the  image  contrast.  As 
also  illustrated  in  Figure  1.2,  the  contrast  of  the  images  may  vary  within  an  image  class, 
so  body  forces  relying  on  metrics  such  as  squared  intensity  differences  are  inadequate. 
Some  methods  avoid  using  absolute  intensity  information  by  using  the  image  intensity 
gradient  instead.  As  has  been  previously  mentioned,  these  methods  ignore  information 
within  regions  and  are  sensitive  to  noise  and  initialization. 

A  third  challenge  is  deformation  in  the  presence  of  multiple  contours  or  objects.  Many 
deformable  methods  that  are  well  defined  for  a  single  object  or  contour  are  not  easily 
extendible  to  multiple  contours. 

1.3  Overview  of  the  Proposed  Method 

This  thesis  describes  a  new  template  deformation  method  with  significant  develop¬ 
ments  in  overcoming  the  challenges  just  described.  The  model  is  made  contrast-tolerant 
by  computing  a  body  force  based  on  the  mutual  information  between  the  template  and 
study  images.  A  viscous  fluid  internal  model  is  used  to  allow  large,  curved  deformations 
while  maintaining  the  topology  of  the  template.  Over-deformations  controlled  through 
an  additional  body  force  based  on  prior  shape  information.  Due  to  the  volumetric  nature 
of  the  fluid  model,  deformation  of  multiple  objects  is  inherent.  More  explicitly,  (1.3)  is 
rewritten  as 

C\\  +  bd  +  bp  =  0  (1-4) 

where  £  is  a  fluid  transformation  operator,  bd  is  a  force  based  on  the  mutual  information 
between  the  image  data  and  the  template,  and  bp  is  a  force  based  on  learned  shape 
information. 
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Template 

Figure  1.3  Schematic  of  the  proposed  deformation  procedure.  A  template  is  incremen¬ 
tally  deformed  to  match  the  image  data  and  the  prior  knowledge. 

Figure  1.3  shows  a  pictorial  schematic  of  the  method.  A  study  image  d  and  a  template 
a  are  input  into  the  system.  The  template  is  iteratively  deformed  according  to  (1.4). 

The  work  herein  concentrates  on  2-D  input  images,  although  some  3-D  data  was  used 
in  the  generation  of  experimental  results. 

1.4  Main  Contributions 

•  Contrast-independent  template  alignment.  An  in-depth  study  of  mutual  informa¬ 
tion  as  it  applies  to  nonrigid  template  deformation  is  presented.  Conditions  for 
optimality  are  defined.  A  study  of  the  bias  and  variance  of  the  mutual  information 
between  images  is  also  presented.  Limitations  on  the  metric  are  investigated. 

•  Mutual  information  bound.  A  bound  on  the  maximal  mutual  information  is  derived. 
Study  of  the  bound  helps  to  identify  limitations  of  MI  as  an  alignment  metric. 

•  Body  force  based  on  maximization  of  mutual  information.  A  formula  is  derived  by 
which  the  mutual  information  between  a  study  image  and  template  is  converted  to 
a  body  force.  The  complexity  of  the  body  force  calculation  is  linearly  proportional 
to  the  image  size. 
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•  Fluid  deformation  model.  A  fluid  model  is  developed.  The  model  maintains  de¬ 
sirable  properties  of  an  existing  fluid  model  while  being  computationally  more  ef¬ 
ficient.  The  model  is  proven  to  maintain  the  topology  of  the  template  throughout 
deformation. 

•  Prior  shape  information.  A  model  is  developed  to  capture  both  inter-  and  in¬ 
trashape  variations  among  regions  of  the  template.  The  model  is  demonstrated  to 
improve  the  deformation  result. 

•  Application  to  brain  image  segmentation.  The  developed  model  is  incorporated 
into  a  brain  image  segmentation  method  [6]  that  is  part  of  an  ongoing  project  at 
the  Beckman  Institute  for  Advanced  Science  and  Technology  on  the  University  of 
Illinois  at  Urbana-Champaign. 

1.5  Organization  of  the  Thesis 

The  next  chapter  gives  a  brief  review  of  research  related  to  the  work  herein.  Chapter 
3  summarizes  relevant  concepts  from  probability  and  information  theory.  The  theory 
and  description  of  the  proposed  deformable  model  are  found  in  Chapters  4,  5,  and  6, 
which  discuss  the  information-theoretic  body  force,  the  fluid  transform,  and  shape-based 
constraints,  respectively.  Application  to  brain  image  segmentation  is  shown  in  Chapter 
7,  and  Chapter  8  gives  concluding  remarks  and  discusses  areas  for  further  research. 


8 


CHAPTER  2 


BRIEF  LITERATURE  REVIEW 

2.1  Active  Contours 

As  mentioned  previously,  the  main  concepts  of  deformable  models  have  origins  dating 
back  to  at  least  the  early  1970s.  Work  by  Widrow  [3]  involved  rubber  masks,  and  Fischler 
and  Elschlager  [4]  used  rigid  components  held  together  by  “springs”  to  elastically  match 
a  model  to  a  test  image. 

Deformable  methods  did  not  become  widely  popular,  however,  until  Kass  et  al.  [7] 
reported  the  development  of  active  contours ,  or  snakes.  The  snake  model  consists  of 
an  elastic  contour  that  is  deformed  according  to  a  cost  function.  The  contour  is  active 
in  the  sense  that  it  can  adapt  itself  to  fit  the  image.  The  cost  function  is  the  summa¬ 
tion  of  an  image-dependent  term,  usually  based  on  the  image  intensity  gradient,  and  an 
image-independent  regularization  term.  The  cost  function  is  minimized  using  gradient 
descent.  Essentially,  an  operator  specifies  an  initial  contour  near  a  region  boundary  and 
the  method  actively  seeks  a  close  edge  within  the  image  under  regularizing  constraints 
of  tension  and  rigidity.  The  contour  is  considered  to  be  fully  deformed  when  the  regu¬ 
larization  forces  acting  on  the  contour  balance  out  the  attractive  forces  of  the  image. 

Since  its  introduction,  several  improvements  and  modifications  to  the  active  contour 
method  have  been  proposed  including  modifications  on  the  external  force  [8,  9],  internal 
force  [10],  and  the  addition  of  a  balloon  force  to  improve  performance  in  noisy  situations 
[11].  Others  have  investigated  different  optimization  strategies  [12],  use  of  snakes  in 
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analysis  of  multiobject  scenes  [13],  and  the  relationship  of  snakes  to  other  algorithms 

[14]- 

The  extension  of  snakes  to  multiobject  scenes  is  of  major  importance  to  brain  image 
segmentation  because  of  the  multiple  structures  and  tissue  classes  that  exist.  A  straight¬ 
forward  approach  to  the  multiple  surface  deformation  is  to  deform  each  surface  separately 
using  snakelike  methods  and  then  interpolate  between  the  deformed  surfaces  using  an 
elastic  warping  [15].  Because  the  initial  surfaces  are  deformed  independently,  the  spatial 
relationship  between  the  contours  is  ignored  as  they  deform. 

2.2  Implicit  Snakes 

Another  method  that  has  been  investigated  for  segmentation  of  multiple  objects  is 
implicit  snakes  or  propagating  fronts  [16,  17,  18,  19,  20,  21].  Implicit  snakes  are  based 
on  the  level-set  theory  introduced  by  Osher  and  Sethian  [22].  In  this  approach  the 
multiple  contours  (or  surfaces)  to  be  deformed  are  represented  by  a  level  set  of  a  higher¬ 
dimensional  function.  The  contours  are  then  indirectly  deformed  based  on  the  evolution 
of  the  higher-dimensional  function,  which  is  based  on  edge  attraction  similar  to  the  snake 
method.  The  implicit  snake  method  is  analogous  to  a  grass-fire  propagation.  The  flame 
front  propagates  through  a  grass  field  (the  image),  slows  when  the  image  gradient  is  high, 
and  finally  stops  at  the  edge.  Also,  once  a  particle  (pixel)  has  been  “burnt”  (crossed 
by  the  front)  it  stays  burnt,  so  the  contour  is  free  to  split  or  merge.  Use  of  implicit 
snakes  for  multiple  objects  is  limited,  though,  since  the  original  formulation  does  not 
provide  for  concentric  contours.  Extension  to  handle  concentric  contours  has  had  limited 
success.  In  [17],  Niessen  et  al.  found  tracking  multiple  level-sets  to  be  incompatible 
with  balloon  forces  used  to  overcome  insignificant  edges.  An  additional  problem  is  that 
the  implicit  snake  model  places  no  restriction  on  the  topology  of  the  structure  being 
segmented.  This  can  easily  lead  to  an  over-segmentation  of  the  image,  resulting  in  the 
need  to  merge  regions.  The  implicit  snake  model  also  inherits  some  of  the  problems  of  the 
original  snake  model:  dependency  on  only  image  edges  and  sensitivity  to  noise.  Finally, 
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Niessen  et  al.  [17]  concluded,  after  some  investigation  of  the  usefulness  of  implicit  snakes 
in  medical  imaging  applications,  that  without  user-supplied  initialization,  topological 
constraints,  and  a  priori  knowledge,  it  is  highly  unlikely  that,  for  example,  in  an  MR 
brain  image  a  gradient-dependent  metric  will  suffice  for  delineation  of  the  white  matter 
surface.  Therefore,  implicit  snake  models  do  not  seem  to  be  suited  for  some  of  the  more 
challenging  problems  in  medical  image  segmentation. 

2.3  Deformable  Templates 

An  alternative  approach  to  multiple  surface  deformation  is  template  (atlas)  deforma¬ 
tion  [23,  24,  25,  26].  Template  deformations  model  the  entire  image  volume  as  a  3-D 
lattice  such  that  each  element  in  the  lattice  is  connected  to  each  of  its  neighbors.  A 
force  on  any  piece  of  the  volume  is  distributed  to  other  elements  in  the  volume  through 
the  neighbor-to-neighbor  connections.  Deformable  template  methods  have  the  advantage 
over  snakelike  methods  in  that  spatial  relationships  among  volume  elements  are  calcu¬ 
lated  automatically  through  the  interconnections  of  the  lattice.  A  main  advantage  of 
this  is  that  multiple  structures  within  an  atlas  can  be  deformed  simultaneously.  Forces 
acting  on  the  separate  regions  influence  the  deformations  of  the  other  regions.  The  forces 
acting  on  neighboring  regions  either  compete  and  balance  each  other  out  or  combine  and 
reinforce  each  other.  These  template  models  typically  consist  of  an  external  force  to  drive 
the  template  into  alignment  with  an  image  and  a  regularization  term  based  on  physical 
principles  of  elasticity  or  fluidlike  behavior.  In  these  approaches,  there  is  no  attempt  to 
infer  that  portions  of  the  imaged  object  exhibit  elastic  or  fluidlike  behavior  or  to  model 
actual  developmental  growth  of  particular  structures;  rather,  these  physical  models  act 
as  convenient  analogies  through  which  to  derive  a  regularization  force. 

Although  template  deformation  presents  an  advantage  in  working  with  multiple  struc¬ 
tures,  there  are  some  limitations  with  implementations  described  in  the  literature.  The 
method  presented  by  [24]  bases  the  deformation  on  physical  elastic  constraints.  This 
model  penalizes  large  nonlinear  deformations  in  the  template.  In  the  case  of  brain  image 
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segmentation  as  shown  in  Figure  1.2,  different  brains  may  have  large  nonlinear  differences 
in  anatomical  structure.  To  cope  with  this,  Christensen  et  al.  [25]  developed  a  fluid-based 
model  in  which  the  deformation  is  based  on  a  non-mass-conserving  fluid  continuum  me¬ 
chanics  analogy.  Both  [24]  and  [25],  however,  rely  on  knowledge  of  the  image  contrast 
due  to  the  use  of  a  squared-error  minimization  or  intensity  correlation  between  the  input 
image  and  the  template.  Thus,  either  a  separate  template  for  each  possible  contrast  or 
a  preprocessing  step  to  estimate  contrast  is  required.  Also,  squared-error  minimization 
and  intensity  correlation  techniques  are  based  on  the  image  intensity  gradient,  so  they 
do  not  take  full  advantage  of  the  region  information  within  a  structure. 

Another  disadvantage  with  existing  template  deformation  methods  is  their  limited  use 
of  prior  shape  information.  The  elastic  and  fluid  regularization  terms  have  been  shown  to 
be  equated  to  prior  probabilities  through  a  Gibbs  distribution  [1]  and  can  be  thought  of 
as  capturing  the  Markov  relationship  between  template  elements  [27].  In  practice,  shape 
variations  exhibit  non-Markovian  properties,  but  existing  template  deformation  methods 
are  not  effective  in  capturing  this  type  of  dependency. 


2.4  Information-Based  Template  Matching 

Information-based  metrics  have  been  used  for  alignment  since  at  least  1982,  when 
Mars  et  al.  [28]  used  the  mutual  information  to  estimate  time  delays  in  electroencephalo¬ 
gram  (EEG)  signals.  Information-based  metrics  have  been  introduced  to  measure  rigid 
alignment  of  images  by  Viola  and  Wells  [29],  Wells  et  al.  [30],  Collignon  et  al.  [31],  Maes 
et  al.  [32],  and  Studholme  et  al.  [33].  Measures  proposed  include  joint  entropy,  mutual 
information,  mutual  information  minus  joint  entropy,  and  mutual  information  divided  by 
joint  entropy. 

More  recently,  mutual  information  has  been  investigated  in  a  nonrigid  matching  set¬ 
ting.  The  extension  from  a  rigid-body  alignment  to  non-rigid-body  deformation  is  not 
straightforward.  As  mentioned  in  Section  1.2,  the  degrees  of  freedom  are  much  higher 
for  non-rigid-body  deformations.  This  is  especially  important  when  using  MI,  since  MI 
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assumes  very  little  prior  knowledge  about  the  contrast  and  intensities  in  the  image.  This 
increases  the  possibility  of  converging  to  a  local  extremum. 

Hata  et  al.  [34]  use  mutual  information  between  two  images  to  drive  small  elastic 
deformations  similar  to  [24].  Optimization  is  performed  using  the  stochastic  gradient 
method.  The  gradient  calculation  is  quadratic  in  the  number  of  samples  used  to  approx¬ 
imate  probabilities;  thus,  only  a  subset  of  the  image  pixels  is  used  in  the  calculation. 
Although  large,  nonlinear  deformations  are  not  allowed,  preliminary  studies  for  small, 
linear  deformations  are  encouraging. 

Maintz  et  al.  [35]  propose  a  deformation  which  first  performs  a  rigid-body  registra¬ 
tion  and  then  refines  the  result  by  performing  registration  on  localized  windows.  The 
correspondence  measure  is  based  on  a  probability  maximization  which  can  be  shown  to 
be  similar  to  maximization  of  mutual  information.  Probabilities  estimated  after  rigid 
registration  of  the  image  are  assumed  to  be  adequate  approximations  of  the  probabilities 
after  local  deformation  and  are  not  updated.  Maintz  et  al.  found  a  trade-off  in  the  win¬ 
dow  size;  a  large  window  size  leads  to  inability  to  capture  local  variations,  while  a  small 
window  size  leads  to  violations  of  spatial  connectivity  of  the  image.  Results  demonstrate 
the  over-deformation  problem  described  in  Section  1.2. 

Work  by  Gaens  et  al.  [36]  is  similar  to  that  of  [35]  where  local  transformations  are 
found  by  dividing  images  into  regions  and  translating  these  regions  to  increase  the  local 
similarity  criterion.  Gaens,  however,  updates  the  intensity  probabilities  as  local  de¬ 
formation  occurs.  Additionally,  Gaussian  smoothing  is  performed  to  promote  spatial 
connectivity  of  the  image. 


2.5  MDL-Based  Image  Partitioning 

Methods  based  on  the  minimum  description  length  (MDL)  principle  [37,  38]  have 
been  proposed  for  image  understanding.  The  MDL  principle  is  a  general  approach  in 
scientific  research  in  which  possible  explanations  are  weighted  by  their  complexity.  The 
first  MDL-based  image  partitioning  is  credited  to  Leclerc  [37].  Although  these  methods 
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are  not  usually  considered  deformable  methods,  they  are  related  to  the  research  presented 
here  because  they  contain  an  entropy-related  cost.  The  MDL  principle  asserts  that  we 
choose  the  simplest  explanation  consistent  with  the  observed  data  and  prior  knowledge 
about  the  system.  The  simplicity  or  complexity  of  the  system  can  be  measured  using 
entropy  or  Kolmogorov  complexity  (see  Sections  3.3.1  and  3.3.3). 

The  typical  MDL-based  image  segmentation  procedure  assumes  a  descriptive  lan¬ 
guage  with  which  to  describe  images,  and  then  chooses  the  shortest  description  in  terms 
of  that  language.  The  descriptive  language  used  in  [37]  was  a  low-order  polynomial  de¬ 
scription  of  intensity  variation  within  each  region  and  a  chain-code-like  description  of  the 
region  boundaries.  Other  work  (see,  e.g.,  [38])  has  been  done  to  modify  the  descriptive 
language  or  modify  the  search  scheme  for  the  shortest  description.  A  fundamental  dif¬ 
ference  between  MDL-based  methods  and  the  information-based  methods  described  in 
Section  2.4  is  that  MDL  methods  measure  the  entropy  with  respect  to  a  particular  pre¬ 
selected  descriptive  language,  while  the  information-based  methods  attempt  to  measure 
the  information  or  correspondence  between  two  images. 


2.6  Deformations  Using  Shape  Information 

The  deformable  methods  presented  so  far  rely  on  local  information  from  neighboring 
image  points  in  the  deformation  process.  Other  deformable  methods  are  based  on  more 
global  shape  information.  Particularly  relevant  to  this  thesis  are  methods  introduced  by 
Cootes  et  al.  [39,  40],  Szekely  et  al.  [41],  and  Duta  and  Sonka  [42]  in  which  prior  shape 
models  are  trained  using  eigenshape  analysis  and  are  used  to  constrain  a  snakelike  defor¬ 
mation.  The  prior  shape  models  are  linear  models  and  shape  variations  are  expected  to 
be  within  a  ellipsoidal  region  around  the  average  shape  in  the  shape  space.  As  explained 
in  [40],  in  certain  instances  this  is  not  the  case.  Dividing  a  complex  shape  into  several, 
more  well-behaved,  subshapes  can  be  problematic  because  correlation  between  points 
in  different  subshapes  is  ignored.  Also,  topology  between  subshapes  is  not  necessarily 
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conserved.  A  conflict  could  occur,  for  example,  if  image  points  belong  to  more  than  one 
subshape  [42]. 

Another  parametric  deformable  method  is  that  of  deformable  superquadrics  [43,  44]. 
Superquadrics  are  a  family  of  parametric  shapes  that  are  symmetric  in  nature.  The 
deformable  superquadric  model  can  deform  both  locally  and  globally  by  incorporating 
the  global  shape  parameters  of  a  superellipsoid  with  the  local  degrees  of  freedom  of  a 
membrane  spline  in  a  Lagrangian  dynamics  formulation  [2].  Bardinet  et  al.  [45]  combined 
a  parametric  model  with  a  volumetric  deformation  by  fitting  a  deformable  superquadric 
to  segmented  3-D  cardiac  images  and  then  refining  the  fit  using  a  volumetric  deformation 
technique  known  as  free  form  deformation  (FFD).  An  FFD  can  be  visualized  as  a  rubber¬ 
like  box  in  which  the  object  to  be  deformed  is  embedded.  The  outer  sides  of  the  box  itself 
are  deformed  and  the  deformations  of  the  box  are  transmitted  to  the  embedded  objects. 
Because  the  FFD  is  volumetric,  it  allows  more  than  one  surface  model  to  be  deformed 
simultaneously.  Bardinet  requires  a  known  segmentation  as  an  input  and  then  attempts 
to  compactly  describe  the  segmentation. 

Christensen  et  al.  [46]  and  Joshi  [47]  also  proposed  methods  to  capture  variations  in 
deformations  from  a  template  image.  Like  the  work  of  Bardinet  et  al.,  the  shape  modeling 
is  descriptive  rather  than  generative;  once  a  template  has  been  deformed  using  either  a 
semiautomated  landmark  method  or  elastic  method  without  shape  priors,  the  resulting 
transform  is  then  tested  against  a  hypothesis  to  determine  if  the  anatomy  under  study  is 
normal  or  abnormal.  Statistics  of  the  template  deformation  are  modeled  using  Gaussian 
random  fields  with  independent  elements. 
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CHAPTER  3 


RELEVANT  TOPICS  FROM  PROBABILITY  AND 
INFORMATION  THEORY 


The  purpose  of  this  chapter  is  to  introduce  some  theorems,  concepts,  and  notations 
related  to  probability  and  information  theory  that  will  be  used  later  in  the  thesis.  In- 
depth  treatment  of  these  topics  can  be  found  in  [48,  49,  50,  51,  52,  53]  from  which 
this  section  was  derived.  Standard  proofs  are  not  provided  unless  they  have  specific 
significance  to  the  research  presented  in  the  chapter. 

3.1  Random  Variables 

A  random  variable  (r.v.)  is  a  mathematical  model  for  a  phenomenon  that  behaves 
in  an  unpredictable  manner  from  the  viewpoint  of  the  observer.  For  example,  the  phe¬ 
nomenon  may  be  a  voltage  measurement  across  one  component  in  a  circuit,  the  outcome 
of  a  coin  toss,  the  value  of  a  stock  market  index,  or  a  pixel  intensity  in  an  image.  The 
outcome  of  the  event  or  experiment  may  be  unpredictable  for  several  reasons  including 
computational,  modeling,  or  measurement  limitations.  Such  uncertainty  can  be  conve¬ 
niently  modeled  by  letting  the  outcome  of  such  an  experiment  be  an  r.v. 

Random  variables  can  be  either  discrete  or  continuous.  Loosely  speaking,  when  the 
number  of  possible  outcomes  is  countable,  the  random  variable  is  considered  discrete, 
and  if  the  range  of  X  is  continuous,  the  r.v.  is  considered  continuous.  As  a  discrete 
example,  let  X  represent  the  roll  of  a  die.  By  convention,  upper  case  letters  are  used  to 
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represent  r.v.’s.  The  set  of  all  possible  outcomes  of  the  r.v.  is  called  the  sample  space , 
and  in  this  case  the  sample  space  is  Q,x  =  {1, 2, 3, 4, 5, 6}.  Lower  case  letters,  like  x,  are 
used  to  represent  elements  of  the  sample  space.  A  probability  measure  is  defined  for  all 
possible  subsets  of  the  sample  space.  The  notation  used  here  is  Px(x)  =  Pr(X  =  x), 
which  in  the  die  example,  is  the  probability  that  the  outcome  of  the  roll  of  the  die  turns 
up  x.  For  a  fair  die,  Px(x)  =  |,  Mx  €  Qx  since  the  outcomes  are  equally  likely.  In 
some  cases  the  notation  is  further  reduced  to  P(x)  with  the  hope  that  it  is  clear  from 
the  context  which  random  variable  is  being  referred  to. 

3.1.1  Probability  density  functions  and  probability  mass 
functions 

The  collection  of  probability  measures  on  all  elements  of  the  sample  space  is  called  the 
probability  mass  function  (PMF)  for  discrete  r.v.’s.  A  similar  function  defined  for  con¬ 
tinuous  r.v.’s  is  the  probability  density  function  (PDF).  For  both  continuous  and  discrete 
r.v.’s  a  cumulative  distribution  function  (CDF)  exists  and  is  defined  by 

F(x)  =  Pr(X  <  x)  (3.1) 

For  discrete  r.v.’s  it  is  the  case  that 

5^  P(x)  =  1  (3.2) 

xEtlx 

and  also 

0  <  P{ x)  <  1  (3.3) 

For  continuous  r.v.’s  the  probability  density  function  can  be  defined  as 

(3-4) 

Although,  it  is  true  that 
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it  is  not  the  case  that 


0  <  f{x)  <  1  (3.6) 

since  although  f(x)  must  be  nonnegative  for  all  x,  it  can  be  arbitrarily  large  at  a  point. 

A  discrete  r.v.  that  is  relevant  to  this  thesis  is  the  Bernoulli  r.v.  A  Bernoulli  r.v.  has 
two  possible  outcomes,  say  0  or  1.  Let  -Px(O)  =  p  and  Px(  1)  =  q,  with  p  >  0,  q  >  0, 
p  +  q  =  1.  An  example  of  a  Bernoulli  r.v.  is  a  coin  toss.  If  the  coin  is  “fair”  then 
p  =  q  =  0.5. 

Perhaps  the  most  used  continuous  r.v.  model  is  the  Gaussian  r.v.  The  PDF  of  a 
Gaussian  r.v  is  described  by 

(3.7) 

where  p  and  a  are  called  the  mean  and  standard  deviation  of  the  random  variable.  The 
distribution  described  by  (3.7)  is  named  the  normal  distribution  and  is  abbreviated  by 
N(p,*2). 

3.1.2  The  Gibbs  distribution 

The  Gibbs  distribution  is  a  distribution  based  on  principles  of  statistical  mechanics 
and  probabilities  of  energy  states  at  thermal  equilibrium.  Consider  a  physical  system 
which  can  reside  in  any  of  a  large  number  of  possible  states.  Let  Pi  denote  the  probability 
that  the  system  resides  in  state  i.  Also,  define  £,  as  the  energy  of  the  system  when  it  is 
in  state  i.  A  fundamental  result  from  statistical  mechanics  states  that  when  the  system 
is  in  thermal  equilibrium,  state  i  occurs  with  probability 

Pl  =  hxp{--£r)  (3'8) 

where  T  is  the  absolute  temperature  in  kelvin,  kg  is  Boltzmann’s  constant,  and  Z  is  a 
normalizing  constant  given  by 

z  =  Eexp(-^)  <3-9) 

i 
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The  principle  of  the  Gibbs  distribution  has  been  used  in  signal  processing  and  com¬ 
puter  vision  areas  to  relate  system  energies  to  probabilities.  In  a  less  formal  setting  than 
that  of  (3.8),  probability  density  functions  can  be  written  in  the  form 

f(x)  =  ^exp(^-£(x)Sj  (3.10) 

where  £(x)  is  the  Gibbs  potential  for  random  variable  X.  One  example  of  a  Gibbs 
potential  is  the  Gaussian  potential  given  by 

=  £  (3.H) 

Similarly,  consider  random  vectors  X  and  Y  related  by 

Xi  =  Yi  +  rji  (3.12) 

where  {77*}  is  a  set  of  independent  samples  from  a  Gaussian  process.  Then  the  energy 
associated  with  P(y|x)  is 

£(yix)  =  ^Et^]2  (313) 

i 

3.1.3  Functions  of  random  variables 

Random  variables  can  be  used  in  mathematical  equations  just  as  nonrandom  variables 
are.  The  value  of  an  equation  that  includes  a  random  variable  is  also  random;  an  r.v. 
Y  can  be  defined  by  Y  =  g{X).  If  the  CDF  is  known  for  X ,  the  CDF  for  Y  can  be 
determined  by 

F(y)  =  Pr(Y  <  y)  (3.14) 

=  Pr(g(X)  <  y)  (3.15) 

=  Pr(X  <  g-'(y))  (3.16) 

assuming  g~l  exists.  Probability  mass  functions  can  be  defined  by 

p(y)  =  E  p(x)s(y^{^))  (3-17) 

x£Qx 
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where  <5(-,  •)  is  the  Kronecker  delta  function  with  property 


11  a  =  b, 

(3.18) 

0  otherwise 

For  continuous  r.v.’s  it  is  possible  in  many  cases  to  determine  f(y )  by 

Mv)  =  M  (3-19> 

I  dx  I 

Random  variables  model  single  events  which  have  a  single  outcome.  Phenomena 
that  have  more  than  one  outcome  may  be  modeled  with  random  vectors  or  random 
processes.  Random  vectors  and  random  processes  are  indexed  collections  of  random 
variables.  Random  processes  differ  from  random  vectors  in  that  they  have  an  infinite 
index  which  almost  always  corresponds  to  time.  Random  vectors  are  denoted  in  boldface 
type,  such  as  X. 

3.1.4  Joint  and  conditional  probabilities 

When  dealing  with  more  than  one  r.v.,  joint  and  conditional  probabilities  exist.  The 
joint  probability  of  X  and  Y  is  the  probability  of  two  events  both  occurring,  written  by 

Pxy(x, y )  =  Pr{X  =  ®  and  Y  =  y)  (3.20) 

Two  random  variables  are  independent  when 

PXY(x,y)  =  Px(x)PY(y )  (3.21) 

If  (3.21)  does  not  hold,  then  the  two  random  variables  are  dependent;  that  is,  knowledge  of 
the  outcome  of  one  random  variable  will  influence  the  probability  of  the  other  occurring. 
The  probability  of  X  given  knowledge  of  the  outcome  of  Y  is  a  conditional  probability 
and  is  denoted  by 

Px\y{x\y)  =  Pr(X  =  x  given  Y  =  y)  (3.22) 
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In  all  cases 


Pxr(x,y)  =  Px\Y(x\y)PY(y)  (3.23) 

Probabilities  Px(x)  and  Py(y)  themselves  are  called  marginal  probabilities  and  can  be 
derived  from  the  joint  probability  using 

Px(x)  =  £  Pxy(x,v)  (3.24) 

yeQy 

Mv)  =  Y,  pXY(x,y)  (3.25) 

x€Qx 

3.1.5  Simple  statistics  of  random  variables 

A  statistic  is  a  deterministic  value  computed  about  a  random  variable  which  lends  to 
an  understanding  of  its  uncertain  behavior.  An  often  used  statistic  is  the  expected  value 
or  mean  of  an  r.v.  The  expected  value  Ex(X)  of  a  discrete  r.v.  is  defined  as 

Ex(X)  =  XiPx(xi)  (3.26) 

Other  notations  used  for  Ex(X)  are  px  or  E(X). 

Intuitively  E(X)  is  the  average  value  that  samples  drawn  from  Px  will  take  on.  In 
fact,  it  can  be  proven  that  as  the  number  of  samples  grows,  the  average  value  of  the 
samples  will  converge  to  the  mean  of  X.  More  formally,  consider  an  experiment  for 
which  outcomes  follow  a  probability  law  Px(x).  Let  Xiyi  =  1,...  ,n  denote  random 
variables  associated  with  n  outcomes  of  the  experiment,  then 

1  71 

E(X)  =  lim  -Vxi  (3.27) 

n->oo  n 

i— 1 

Equation  (3.27)  is  a  statement  of  the  law  of  large  numbers  (see  [52],  p.  357). 

For  a  limited  number  of  samples,  the  mean  of  the  r.v.  can  be  estimated  by 

h  =  -f'xi  (3.28) 

n  * — ' 
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Another  statistic  is  the  variance  of  a  random  variable.  It  is  defined  mathematically 

by 

VAR(X)  =  Ex[(X  -  E(X))2}  =  Ex(X 2)  -  E2{X)  (3.29) 

The  standard  deviation  of  a  random  variable  is  another  popular  statistic.  It  is  simply  the 
square  root  of  the  variance. 

Much  of  the  utility  of  an  r.v.  is  dependent  on  knowledge  of  an  associated  PDF  or 
PMF.  However,  in  many  cases  the  probability  distributions  are  unknown  and  must  be 
estimated  from  measured  data.  One  popular  estimation  technique  is  to  assume  the  r.v. 
follows  a  probability  law  that  is  a  function  of  a  few  statistical  parameters  such  as  the 
mean  and  variance.  These  statistics  are  then  estimated  from  experimental  samples.  For 
example,  the  noise  in  image  data  is  often  assumed  to  conform  to  a  normal  distribution. 
The  distribution  is  then  characterized  by  its  mean  and  variance. 

3.1.6  Central  limit  theorem 

One  justification  for  the  use  of  a  normal  probability  distribution  function  in  modeling 
random  variables  is  the  central  limit  theorem.  Basically,  the  central  limit  theorem  says 
that  the  distribution  of  the  normalized  sum  of  a  large  number  of  mutually  independent 
random  variables  with  zero  means  and  finite  variances  tends  to  the  normal  probability 
distribution  provided  that  the  individual  variances  are  small  enough.  The  following  is  a 
more  formal  statement  of  this  important  theorem. 

Theorem  3.1  (Central  Limit  Theorem,  [52],  p.  213)  Let  X\,...  , Xn  be  n  mutu¬ 
ally  independent  (scalar)  r.v.’s  with  CDFs  Fi(xi),  F2(x2), . . .  ,  Fn(xn),  respectively,  such 
that 

E[Xk ]  =  0,  Var[Xk]  =  a\ 

and  let 

2  2  ,  I  2 

Sn  ~  °\  +  '  •  •  +  an 
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If  for  a  given  e  >  0  and  n  sufficiently  large,  the  ak  satisfy 


ak  <esl  k  =  1, . . .  ,n 


then  the  normalized  sum 


Y  =  (Xi  +  . . .  +  Xn )/ s 


converges  to  the  standard  normal  PDF,  that  is,  lim^oo  fy(y)  =  N(0, 1). 


3.2  Parzen  Window  Density  Estimation 

A  popular,  nonparametric  approach  to  probability  density  estimation  is  the  Parzen 
window  technique.  This  technique  is  much  more  flexible  than  parametric  methods  since 
Parzen  estimation  requires  only  that  the  density  being  estimated  be  smooth.  The  general 
form  of  the  estimate  is 

1  N 

fx(x)  =  -'£R(x-xi)  (3.30) 

i= 1 

where  fx{x)  is  an  estimate  of  the  probability  density  at  point  x,  the  x,  are  the  results 
of  N  samples  drawn  independently  from  the  distribution  being  estimated,  and  R(x)  is  a 
windowing  or  smoothing  function.  Equivalently,  (3.30)  can  be  rewritten  as 

N 

nr  _  nr  - 

(3.31) 


/*<*>  = 

2=1 


to  explicitly  show  the  dependence  of  fx(x)  on  the  width  bx  of  the  windowing  function. 
The  estimate  is  assured  to  be  a  legitimate  density  function  if 

R(x)  >  0,  \/x 


and 

J  R(x)dx  =  1 
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The  most  common  windowing  functions  are  symmetric  about  the  origin,  axe  uni- 
modal,  and  fall  off  quickly  to  zero.  Very  common  windowing  functions  are  the  boxcar 
(rectangular)  function  and  the  Gaussian  function.  The  Parzen  density  estimator  essen¬ 
tially  computes  a  local  average  of  the  samples.  Notice  that  if  R  is  symmetric  about  the 
origin,  the  estimate  at  a  point  can  be  calculated  as  a  weighted  sum  over  all  sample  points, 
where  the  weighting  is  determined  by  the  window  function  centered  at  the  query  point. 

The  technique  relies  on  the  fact  that  the  probability  Pj  that  a  vector  will  fall  in  a 
region  Xj  is  given  by 

Pj  =  [  f(x)dx  (3.32) 

Jx j 

Thus  the  pj  represent  a  smoothed  and  discretized  version  of  f(x).  If  we  assume  that  f(x) 
is  continuous  and  that  Xj  is  small  enough  that  f(x)  does  not  vary  appreciably  within  it, 
then 


Pj  ~  f(xj)V 


(3.33) 


where  Xj  is  a  point  in  Xj  and  V  is  the  volume  of  Xj .  Thus,  the  only  theoretical  restriction 
on  f(x)  is  that  it  is  smooth. 

The  Parzen  estimate  can  also  be  thought  of  as  an  estimate  of  an  expectation: 

1  N 

f(x o)  =  Jf  X,  R(x°  -  **)  K  -  x )1  (3-34) 

i=  1 

As  N  becomes  large  we  have 


lim  f(x0)  =  E[R(x o  -  X)] 

N-+oo 

/oo 

R(x  o  —  x)f(x)dx 

-oo 

=  R(x o)  *  f(x o) 


(3.35) 

(3.36) 

(3.37) 


In  the  limit  f(x)  converges  to  a  convolution  between  R  and  the  true  f(x),  so  f(x)  con¬ 
verges  to  f(x)  if  and  only  if  f(x)  =  R(x)  *  f(x).  When  this  equality  holds,  f(x)  becomes 
an  unbiased  estimator.  For  an  infinite  number  of  samples  this  can  be  accomplished  by 
defining  R(x)  such  that  it  behaves  like  a  Dirac  delta  function  in  the  limit. 
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For  a  finite  number  of  samples,  letting  R(x )  be  a  delta  function  will  usually  give  a 
useless  density  estimate.  However,  under  some  conditions  on  f(x),  the  equality  f(x)  = 
R(x)  *  f(x)  still  holds  for  R(x)  other  than  a  delta  function.  When  f(x)  is  frequency- 
band-limited  and  R(x)  is  a  low-pass  filter  such  that  f(x)  can  pass  undistorted  through 
R(x),  the  Parzen  estimate  will  be  unbiased.  Although  such  a  combination  is  rarely  the 
strict  case,  it  is  approximately  true  in  many  circumstances.  For  example,  many  imaging 
processes  include  an  additive  noise  term.  The  samples  used  in  the  estimation  can  be 
expressed  in  terms  of  another  random  variable  plus  uncorrelated  noise: 

Xi  =  Yi  +  77  (3-38) 


The  density  of  Xi  is  equal  to  the  density  of  Yi  convolved  with  the  density  of  the  noise.  For 
many  noise  models  including  Gaussian  and  chi-squared,  this  has  the  effect  of  smoothing 
the  distribution  of  Y  such  that  A  is  a  low-pass  version  of  F’s  distribution.  In  this  case, 
a  good  Parzen  window  is  one  with  the  same  bandwidth  as  the  noise  density  function. 

Because  the  estimate  must  be  calculated  from  a  finite  number  of  samples,  it  is  in¬ 
structive  to  know  how  the  estimate  varies  with  the  number  of  samples.  The  variance  of 
f(x)  is 


VAR(/(x))  =  E 
=  E 


(/>)  -  £/») 

E  R(*  ~  R(x  -  *’•)) 

^  R(x  -  Xi)  -  ER(x  - 

=  ^  ( ER2(x  -  Xi)  -  E2R(x  -  Xi)) 

=  jj  j  R2(x  -  Xi)f(xi)dxi  -  j  R(x-Xi)f(xi)d: 


(3.39) 

(3.40) 

(3.41) 

(3.42) 

(3.43) 


Equations  (3.39)  and  (3.40)  come  from  (3.29)  and  (3.30).  Equation  (3.42)  follows  from  the 
fact  that  the  variance  of  the  sum  of  independent  variables  is  the  sum  of  the  variances.  The 
result  in  (3.43)  shows  that  the  variance  of  the  estimate  varies  inversely  with  the  number 
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of  samples,  and  will  converge  to  zero  if  f  R2(x  —  X{)f(xi)dxl  and  f  R(x  —  Xi)f(xi)d,Xi  are 
bounded. 

Perhaps  a  more  useful  measure  of  the  utility  of  estimate  f(x)  is  the  mean  squared 
error  from  the  actual  f(x)  rather  than  from  its  own  mean.  This  can  be  written  as  [54] 

E  |  /(*)  -  S(x)  P=  VAR(/M)+  |  Ef(x)  -  }(x)  p  (3.44) 


So  for  unbiased  estimates  of  f(x)  the  mean  squared  error  is  equal  to  the  variance  of  the 
estimator  itself. 

It  can  further  be  shown  [54,  55]  that  the  estimate  f(x)  at  a  particular  point  x  asymp¬ 
totically  approaches  a  normal  distribution  with  mean  f(x)  and  variance  f(x)/(NbN)  for 
many  typical  windowing  functions  including  boxcar,  Gaussian,  and  triangle.  The  proof 
is  shown  below  for  the  1-D  case  using  a  boxcar  function. 

Let 


if \e\  <  0.5, 

otherwise 


(3.45) 


The  probability  that  the  ith  sample  falls  inside  the  windowing  function  centered  at  x,  for 
example,  follows  the  Bernoulli  probability  law.  Let  Yxj  define  these  Bernoulli  random 
variables;  then, 


i— 1 


(3.46) 


The  probability  of  success  of  the  Bernoulli  trial  (that  x  —  bnj 2  <  x{  <  x  +  BN/2)  is 
defined  by 


Pr(Yx,i  =  1  )  =  Fx(x  +  bN/ 2)  -  Fx{x  -  bN/ 2)  *  f(x)bN 


where  Fx  is  the  CDF  of  X  and  f(x)  remains  approximately  constant  over  the  width  of 
the  window.  If  N,  x,  and  bx  are  selected  such  that  N f(x)bN  —  N f2{x)b2N  >  0,  then 
another  random  variable  Z  can  be  defined  by 

E  (Y*,i  ~  f(x)bN) 


z  = 


\/ N f{x)bN  -  Nf(x)bl 


(3.47) 


N 
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By  the  central  limit  theorem,  the  distribution  of  Z  approaches  iV(0, 1)  as  N  — >  0.  Then 
the  distribution  of  f(x)  is  also  normal  with  mean  f(x)  and  variance  f(x)/ {Nbpj)  if  6jv  — ^  0 
and  Nbw  — *  oo  as  N  -4  0. 

For  a  general  windowing  function,  the  estimate  is  asymptotically  normal  with  mean 
and  variance  given  by 

Ef(x )  «  f(x)  (3.48) 

and 

VAR(/(x))  «  j  Rl(v)dv 

For  2-D  estimations, 

Ef{x,y)&  f{x,y) 

and 

VAR (f{x,y))  »  J  Rl{vuV2)dvxdv2  (3.51) 

When  considering  the  shape  of  the  Parzen  window,  one  should  refer  to  previous  work 
done  by  Epanechnikov  [56]  and  Rosenblatt  [54]  in  which  window  shapes  were  studied. 
The  conclusion  drawn  from  [54]  is  that  although  an  optimal  window  can  be  derived, 
which  is  expected  to  reduce  the  variance  most  rapidly  with  increasing  sample  size,  many 
nonoptimal  windows  achieve  approximately  the  same  rate  of  decrease. 

To  illustrate  the  insensitivity  of  the  estimate  to  window  shape,  samples  were  drawn 
from  a  mixture  of  three  Gaussians,  followed  by  the  addition  of  Gaussian  noise  with 
standard  deviation  of  3.  Figure  3.1  shows  the  PDF  from  which  the  samples  were  drawn. 
The  values  of  the  samples  were  then  quantized  into  500  bins.  The  density  was  estimated 
from  these  samples  using  various  Parzen  windows.  Following  the  notation  of  (3.31),  let 
the  width  of  the  Parzen  window  be  defined  by  b n  where 

(3'52) 


(3.49) 


(3.50) 
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Figure  3.1  Plot  of  the  PDF  from  which  samples  were  drawn  for  tests  corresponding  to 
Figures  3.2,  3.3,  and  3.4. 

Figure  3.2  shows  the  squared  error  between  the  estimated  and  the  true  distributions 
as  6jv  varied.  The  Gaussian  and  rectangular  (boxcar)  windows  worked  best  and  had 
approximately  the  same  minimum.  The  triangle  window  was  only  slightly  less  effective.  It 
is  possible  that  the  Gaussian  shape  worked  best  due  to  the  Gaussian  nature  of  underlying 
distribution  and  noise. 

Figures  3.3  and  3.4  illustrate  the  change  in  estimation  accuracy  as  a  function  of  the 
number  of  samples  N.  Figure  3.3  shows  the  estimation  error  for  a  Gaussian  Parzen 
window  as  a  function  of  the  width  of  the  Gaussian  pulse  when  500  samples  were  drawn 
from  the  distribution  shown  in  Figure  3.1.  The  number  of  samples  was  equivalent  to  the 
quantization  levels  of  the  samples.  Figure  3.4  shows  the  equivalent  experiment  when  5000 
samples  are  used.  By  contrasting  Figures  3.3  and  3.4  it  is  evident  that  as  the  number  of 
samples  becomes  larger,  the  optimal  window  width  becomes  smaller.  Less  smoothing  is 
necessary  since  the  variance  of  the  estimate  at  a  specific  x  will  become  smaller. 

The  preceding  experiments  show  that  for  a  relatively  large  number  of  samples  com¬ 
pared,  the  probability  estimate  is  quite  insensitive  to  both  window  width  and  window 
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Figure  3.2  Plot  of  the  error  between  the  estimated  and  actual  PDF  as  a  function 
of  Parzen  window  width  for  different  window  shapes.  Window  shapes  shown  are 
Gaussian,  rectangle,  triangle,  and  the  “optimal”  window  given  in  [54]. 


Figure  3.3  Error  between  the  estimated  and  actual  PDFs  as  a  function  of  window  width 
for  a  Gaussian  window  and  500  samples. 


29 


Figure  3.4  Error  between  the  estimated  and  actual  PDFs  as  a  function  of  window  width 
for  a  Gaussian  window  and  5000  samples. 


x  10~3 


Figure  3.5  Plots  of  the  actual  and  estimated  distributions  using  a  Gaussian  Parzen 
window  with  several  widths. 
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shape.  Throughout  this  thesis  a  Gaussian  windowing  function  with  a  standard  deviation 
of  5  is  used. 


3.3  Information 

The  foundations  of  what  is  known  today  as  information  theory  were  developed  by 
Claude  Shannon  in  1948  in  direct  response  to  the  need  for  electrical  engineers  to  design 
communication  systems  that  were  both  efficient  and  reliable  [51].  Information  theory, 
however,  has  proved  useful  in  far  more  arenas  than  communications,  such  as  statistical 
physics,  computer  science,  statistical  inference  [48],  training  of  neural  networks  [51],  and 
image  processing  [30,  32,  57]. 

The  information  of  an  event  is  the  amount  of  knowledge  gained  by  the  occurrence  of 
that  event.  Consider  a  discrete  r.v.  Y  with  sample  space  fly  C  71.  The  outcome  of  Y 
is  y ,  where  y  can  take  any  value  in  fly  with  probability  P(y).  The  information  gained 
upon  observing  the  outcome  of  Y  is 

%>  =  'OS  (f4))  (3'53) 

where  log  is  usually  taken  to  be  base  2.  The  information  gained  by  the  occurrence 
of  a  likely  event  is  small  since  the  outcome  was  in  some  sense  “expected”;  however,  if 
an  unlikely  event  occurs,  it  provides  much  more  information.  Thus  I(y)  measures  the 
“surprise”  involved  in  the  occurrence  of  y. 

3.3.1  Entropy 

Another  useful  measure  is  the  expected  information  or  expected  surprise  associated 
with  a  random  variable.  Let 

H(Y)  =  £[/«] 

=  -  E  •p(!,)logP(y)  (3.54) 
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The  quantity  H(Y)  is  the  entropy  of  the  random  variable  Y.  Alternate  notations  are 
H(Py )  and  Hy(P),  which  emphasize  the  fact  that  entropy  measures  are  based  on  the 
distribution  of  the  probability  function  rather  than  the  actual  outcome  of  the  random 
variable. 

Entropy  is  also  defined  for  pairs  of  random  variables.  The  joint  entropy  of  two  random 
variables  is 


H(X,Y)  =  -  £  £  P(x,y)logP(x,y)  (3.55) 

VccGOx  VyGOy 

and  the  conditional  entropy  is 

H(X\Y)  =  -  Y,  Y  P(x,y)l°Sp(x\y)  (3.56) 

The  joint  and  conditional  entropies  are  related  by 

H(X,Y )  =  H(X)  +  H(Y  |  X)  (3.57) 

Applying  this  relationship  repeatedly  results  in  the  chain  rule  for  entropy: 

n 

H(X1,X2,...,Xn)  =  £ff(A*  \Xi-u->Xi)  (3-58) 

i— 1 


3.3.2  Mutual  information 


The  relative  entropy  or  Kullback-Leibler  distance  between  two  probability  mass  func¬ 
tions  P(y)  and  Q(y)  is 


wiq)  =  £pG/)1oSq§5 


(3.59) 


The  relative  entropy  D(P\\Q)  is  a  measure  of  the  inefficiency  of  assuming  that  the  dis¬ 
tribution  is  Q  when  the  true  distribution  is  P. 

Using  relative  entropy,  a  related  concept  called  mutual  information  can  be  defined. 
The  mutual  information  of  two  random  variables  I(X ;  F)  is  the  relative  entropy  be¬ 
tween  the  joint  distribution  P(x,y)  and  the  joint  distribution  assuming  independence, 


32 


P(x)P(y). 


I(X-,Y)  =  D(P(x,y)\\P(x)P(y)) 

xexyey  K  ’  yy' 


(3.60) 


3.3.3  Kolmogorov  complexity 

Entropy  is  also  related  to  Kolmogorov  complexity.  Kolmogorov  defined  the  algorith¬ 
mic  (descriptive)  complexity  of  an  object  to  be  the  length  of  the  shortest  binary  computer 
program  that  describes  the  object.  Assuming  the  computer  already  knows  the  size  l(y)  of 
an  object  y ,  then  the  conditional  Kolmogorov  complexity  can  be  defined  as  the  minimum 
length  of  program  p  such  that  a  computer  U  executing  p  and  knowing  l(y)  will  produce 
output  y.  Mathematically  this  is  given  by 


K(y  |  l(y))  = 


min  l(p) 

p:U(p,l(y))=y 


(3.61) 


Under  certain  conditions,  the  expected  value  of  the  conditional  Kolmogorov  complexity 
of  a  random  sequence  converges  to  the  entropy  of  the  sequence  ([48],  p.  153). 
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CHAPTER  4 


TEMPLATE  ALIGNMENT  AND  BODY  FORCE 


The  main  topic  of  this  chapter  is  the  body  force  which  drives  the  template  into 
alignment  with  an  image.  The  chapter  starts  with  a  look  at  the  use  of  MI  as  an  image 
alignment  metric  based  on  work  by  [30,  32,  53,  58],  and  then  extends  the  study  of  MI  to 
the  more  specific  case  of  non-rigid-body  registration. 

4.1  Alignment  Based  on  Mutual  Information 

Consider  an  imaging  process  which  provides  some  information  about  the  underlying 
structure  s(x)  of  an  object  in  the  form  of  an  image  d(x).  For  2-D  images,  x  =  (x,y)  and 
takes  on  values  in  X,  which  without  loss  of  generality  can  be  defined  as  the  unit  square. 
Each  element  of  d  takes  on  a  value  between  0  and  Nj  —  1.  For  grayscale  images  Nj  is 
usually  256.  Each  element  of  s  takes  on  a  value  between  0  and  Na  —  1  where  Na  is  the 
number  of  different  materials/structures  in  the  scene.  The  imaging  process  (see  Figure 
4.1)  can  be  formulated  by 

d(x)  =  C{s{x),(j))  +  r]  (4.1) 

where  0  is  a  single  vector  representing  exogenous  factors  such  as  lighting,  temperature, 
etc.  The  imaging  function  C  defines  the  contrast  of  the  image. 

To  take  advantage  of  the  discrete  nature  of  the  image  and  template,  an  alternate 
notation  for  d,  s,  and  a  can  be  used.  For  an  Nx  by  Ny  element  image,  let  Np  =  NxNy. 
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Figure  4.1  A  general  imaging  model.  The  structure  of  an  underlying  scene  is  converted 
to  an  image  d(x)  by  the  imaging  process. 


Also  let  i  refer  to  any  of  Np  specific  locations  which  correspond  to  an  equally  spaced 
grid  throughout  X.  Then  d,  s,  and  a  can  be  considered  vectors  where,  for  example, 
d  =  (d(i)  :  t  =  1,2,  ...,NP}. 

The  deformable  template  approach  to  image  understanding  is  to  develop  a  coordinate 
displacement  field  u  :  X  — »  X  such  that  a  deformed  template  a(x)  =  a(x  —  u(x)) 
closely  approximates  s.  Due  to  noise,  other  uncertainties,  and  irreversible  effects  of  the 
imaging  process,  it  may  be  impossible  to  know  for  sure  the  exact  nature  of  the  scene 
acquired;  however,  it  is  assumed  that  d  contains  information  about  s.  The  amount  of 
information  that  can  be  inferred  about  s  from  d  depends  on  the  nature  of  C,  the  amount 
of  uncertainties  and  noise,  as  well  as  prior  knowledge  about  the  imaging  process  and  the 
scene. 

For  a  number  of  practical  problems,  C  is  not  known.  Within  this  thesis,  alignment 
measures  which  do  not  depend  on  knowledge  of  C  are  called  contrast-tolerant  or  contrast- 
invariant  metrics.  C  may  not  be  known  for  one  of  several  reasons: 

•  C  may  be  difficult  to  model. 

•  The  important  exogenous  factors  are  unknown  or  difficult  to  measure. 

•  Perhaps  it  is  unknown  a  priori  which  of  many  imaging  processes  was  used  to  gen¬ 
erate  d. 

The  method  presented  herein  assumes  no  specific  knowledge  about  the  nature  of  C , 
only  that  the  information  mutual  to  both  a  and  d  is  maximal  when  correctly  aligned. 
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4.1.1  Squared-error  and  correlation  alignment  metrics 

To  better  motivate  the  need  for  a  contrast-tolerant  metric,  let  us  first  examine  other 
non-contrast-tolerant  metrics  that  have  been  used  to  measure  image  alignment. 

For  the  case  that  C ,  a,  and  0  are  known,  an  alignment  metric  that  is  optimal  in  a 
probabilistic  sense  is 

log  P(d  |  u)  =  log  P( d  |  u,  C,  a,  0)  (4.2) 

=  logP(77  =  d-C'(u(a),q))  (4.3) 

The  second  equality  follows  from  the  fact  that  if  C,  a,  u,  and  0  are  all  given,  then  the 
only  randomness  of  d  is  found  in  rj.  The  function  P(d  |  u)  is  called  the  likelihood  of  d. 
If  rj  is  from  an  additive  white  Gaussian  noise  process  and  pixels  of  d  are  conditionally 
independent  given  C,  a,  u,  and  0,  then  (4.3)  reduces  to 

Np 

iogP(d  |  u)  =  -e\  -  Q2  (d(*) _  &(*(*)»  ^))2  (4-4) 

i=0 

Constants  ei  and  f?2  are  computed  from  the  variance  of  the  noise  and  are  not  important 
to  the  maximization  of  the  log  likelihood.  Because  a  is  a  function  of  u,  the  optimal 
displacement  field  u*  is  found  by 

Np 

u*  =  argmin^  (d(i)  -  C{a{i),4>))2  (4.5) 

U  i=0 

which  is  a  squared-error  minimization  or  least  squares  approach. 

For  large  Np,  squared-error  minimization  is  related  to  a  correlation.  The  summation 
of  (4.5)  is  proportional  to  an  expectation 

Np 

Y.  («*(»)  -  C( S(i),  <P))2  oc  [D  -  C(A,  0)] 2  (4.6) 

i= 0 

=  E[D2]  -  2 E[DC(A,(f>)]  +E[C2(A,4>)]  (4.7) 

where  r.v.’s  D  and  A  are  introduced  to  model  uncertainty  in  the  d(i )  and  a(*).  The 
middle  term  of  (4.7)  is  the  correlation  between  D  and  A.  Correlation  has  been  used  as 
an  alignment  measure  in  the  literature  [24]. 
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Figure  4.2  A  one-dimensional  structure  and  the  resulting  image  based  on  simply  adding 
noise. 

Examples  using  squared-error  alignment 

Consider  the  following  simple  one-dimensional  examples,  where  x  G  [0,1].  Allowable 
transformations  consist  of  all  circular  shifts:  u  G  [0, 1].  Let  C  be  the  identity  function 
such  that 

d(x )  =  s(x)  +  rj  (4.8) 

where  r)  is  white  Gaussian  noise.  Figure  4.2  shows  an  example  s  and  a  possible  resulting 
d  based  on  this  C.  Figure  4.3  shows  the  error  surface  for  all  possible  shifts  of  s  based  on 
(4.5).  The  minimum  squared-error  solution  is  correctly  located  at  zero  shift. 

A  more  complicated  contrast  function  would  be 

C(s)  =  70  -  h rig)!  (4,9) 

Figure  4.4  shows  s  and  d  for  the  imaging  function  of  (4.9).  Figure  4.5  demonstrates  that 
if  C  is  known,  the  correct  alignment  can  be  found.  However,  if  the  wrong  C  is  assumed 
when  finding  the  alignment,  the  resulting  transform  estimate  is  likely  to  be  far  off.  For 
example,  assuming  d  was  created  from  s  using  the  identity  imaging  function  leads  to  the 
error  curve  of  Figure  4.6.  The  minimum  error  does  not  lie  close  to  zero  shift. 
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Figure  4.3  Alignment  error  for  all  possible  shifts  of  the  structure  in  Figure  4.2.  Error 
is  based  on  the  sum  of  squared  intensity  differences. 


Figure  4.4  A  one-dimensional  structure  and  the  resulting  image  based  on  the  imaging 
function  of  (4.9). 
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Figure  4.5  Alignment  error  for  all  possible  shifts  of  the  structure  in  Figure  4.4.  Error 
is  based  on  the  sum  of  squared  intensity  differences  with  the  known  imaging  function  of 
(4.9). 


Figure  4.6  Alignment  error  for  the  structure  in  Figure  4.4  when  the  imaging  function 
was  erroneously  assumed  to  be  the  identity  function. 
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4.1.2  Mutual  information  for  alignment 


An  alternate  measure  of  image  alignment  is  the  estimated  mutual  information  between 
the  image  and  deformed  template.  MI  can  be  viewed  as  a  nonparametric  approach  to 
alignment  because  no  model  is  assumed  for  C.  Instead,  it  is  assumed  that  the  MI 
between  the  study  image  and  the  template  is  maximal  when  they  are  correctly  aligned. 
Implications  of  this  assumption  are  studied  in  Section  4.5.1.  As  mentioned  in  Section 
3.3.1,  entropy  measures  are  not  based  directly  on  intensity  values;  rather,  the  graylevels 
are  treated  as  symbols  and  the  probabilistic  distribution  of  the  symbols  determines  the 
entropy  and  MI.  So,  MI  is  a  more  natural  criterion  for  matching  than  the  squared-error 
approach  when  C  is  not  well  defined. 

Consider  two  images,  d  and  a.  The  mutual  information  between  d  and  a  is  calculated 
by  introducing  random  variables  D  and  A  with  sample  spaces  d  E  Q,d  =  { 0, . . .  ,  A^}  and 

a  e  =  {0, . . .  ,  Na},  and  probabilities  empirically  determined  by 

Np  Np 

PDA(d ,  R2(d(m )  -  5(n)  -  5)  (4-10) 

P  m— 1  n— 1 

i  "* 

P  m= 1 
1  Np 

pA(d)  Rl(a(n)  -  a)  (4.12) 

P  71=1 

Here  R1  and  R2  are  appropriate  ID  and  2-D  Parzen  windowing  functions.  The  MI 
between  the  images  is  then 


Na  1  JV £  1  J-y  /  *  ~  \ 

/P;^)=E  £iW<i.8)iog  ■Di(d’a) 

a= 0  d=  0 


(4.13) 


PD(d)Pi(a) 

For  this  thesis,  notation  is  relaxed  slightly  so  that  /(d;a)  =  I(D,A )  signifies  the  MI 
between  images  d  and  a  using  (4.10)  through  (4.13)  and  implicit  r.v.’s  D  and  A.  Similar 
notation  for  the  entropy  of  images  is  also  used  to  allow  H( d,  a)  =  H(D,A),  H( d)  = 
H(D),  and  H( a)  =  H(A). 

In  theory,  the  estimated  probabilities  in  (4.10)  through  (4.12)  are  continuous  PDFs 
when  the  windowing  functions  are  continuous  functions.  However,  closed  form  solutions 


40 


-0.02 


Figure  4.7  Mutual  information  alignment  error  curve  for  the  example  structure  and 
image  shown  in  in  Figure  4.2. 

for  entropy  and  MI  based  on  continuous  distributions  do  not  exist  in  general.  Although 
there  is  a  closed  form  solution  for  Gaussian  densities,  mixtures  of  Gaussians  must  be 
calculated  empirically.  Because  of  this  and  the  discrete  nature  of  the  images,  there  is 
no  real  advantage  in  maintaining  continuous  probabilities.  Thus,  for  convenience  the 
probabilities  are  quantized  and  expressed  in  a  discrete  form. 

MI  can  be  used  to  determine  the  alignment  of  the  simple  ID  examples  of  Section 
4.1.1.  Figures  4.7  and  4.8  demonstrate  the  ability  of  MI  to  find  the  correct  alignment  for 
imaging  functions  illustrated  in  Figures  4.2  and  4.4,  respectively.  MI  locates  the  correct 
alignment  in  both  cases  without  knowledge  of  the  imaging  function. 

The  gain  in  generality  of  the  metric  is  not  without  cost.  By  comparing  Figures  4.7 
and  4.8  with  4.3  and  4.5,  it  is  evident  that  the  region  of  attraction  to  the  minimum  (the 
width  of  the  main  downward  spike)  is  smaller  for  MI. 
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Figure  4.8  Mutual  information  alignment  error  curve  for  the  structure  and  image  shown 
in  Figure  4.4. 

4.1.3  Parzen  windowing  for  MI 

In  many  implementations,  the  empirical  MI  calculation  is  done  using  Kronecker  delta 
functions  for  the  Parzen  windows  used  in  (4.10)  -  (4.12).  The  probabilities  are  then  just 
the  normalized  histograms  of  the  images  and  the  joint  histogram  of  the  two  images.  This 
simplifies  analysis,  implementation,  and  understanding  of  the  deformation  procedure, 
but  other  choices  of  Parzen  windows  may  lead  to  better  results.  The  major  reason  for 
this  is  that  the  MI  approach  treats  the  image  graylevels  as  symbols.  There  is  no  relative 
distance  between  symbols.  In  symbol  space,  symbol  1  is  not  closer  to  (or  farther  from) 
symbol  2  than  it  is  to  (from)  100,  but  in  intensity  space  1  and  2  are  very  close,  and  1 
and  100  are  clearly  distinguishable. 

Particular  situations  in  which  a  relative  distance  may  be  useful  are  the  cases  of  ad¬ 
ditive  noise  and  partial  volume  effects.  With  additive  noise  and  only  a  finite  number  of 
samples,  the  probability  distributions  may  not  be  sufficiently  estimated  to  show  the  true 
relationship  between  a  symbol  and  the  same  symbol  sightly  corrupted  by  white  noise. 
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Figure  4.9  Visual  explanation  of  the  partial  volume  effect.  Due  to  finite  resolution  of 
the  imaging  process,  an  image  pixel  may  contain  information  from  more  than  one  region. 
(Adapted  from  [59].) 

Partial  volume  effects  are  the  result  of  a  discrete  image  element  lying  on  a  region 
boundary.  The  intensity  of  the  image  contains  information  from  two  different  structures. 
This  is  illustrated  in  Figure  4.9.  As  a  2-D  image  slices  a  3-D  object,  pixels  overlap  more 
than  one  region. 

Parzen  window  estimation  may  reduce  noise  and  partial  volume  effects.  The  smooth¬ 
ing  property  of  the  Parzen  window  imposes  a  relationship  between  similar  graylevels. 
Figure  4.10  shows  a  1-D  image  with  the  horizontal  axis  showing  the  image  position  index 
and  the  vertical  axis  showing  the  intensity  (0-255).  Assume  that  the  gradual  gradient 
from  low  to  high  intensity  is  a  result  of  partial  volume  effects.  The  template  is  a  two  val¬ 
ued  image  which  can  be  shifted  along  the  horizontal  axis  to  divide  the  study  image  into 
two  parts.  Using  MI  with  normalized  histograms  leads  to  the  division  shown  in  Figure 
4.10(a)  with  about  half  of  the  distinct  graylevels  on  each  side  of  the  line.  Figure  4.10(b) 
shows  the  same  experiment  but  using  a  Gaussian  window  to  estimate  the  probabilities. 
The  result  is  that  the  image  is  divided  so  that  all  intensities  above  126  are  on  the  right 
side  of  the  line.  Based  on  partial  volume  effects,  the  division  of  Figure  4.10(b)  is  much 
more  accurate  than  that  of  Figure  4.10(a). 
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Figure  4.10  1-D  image  example  showing  the  utility  of  Parzen  window  probability  esti¬ 
mation  in  the  presence  of  partial  volume  effects.  The  curved  line  is  a  1-D  image  showing 
partial  volume  effects  in  the  transition  from  intensity  0  to  255.  Probability  estimation 
based  on  histogram  normalization  leads  to  the  division  shown  in  (a).  Parzen  window 
estimation  leads  to  the  more  accurate  division  shown  in  (b). 

4.2  Using  the  MI  to  Find  a  Body  Force 

For  the  simple  1-D  examples  above,  locating  the  correct  alignment  was  a  simple  prob¬ 
lem  of  computing  the  MI  for  all  possible  shifts  of  the  template.  For  multi-dimensional, 
non-rigid-body  deformations  the  search  space  is  prohibitively  large,  precluding  an  ex¬ 
haustive  search  of  the  transform  space.  Instead,  a  gradient  search  method  becomes  more 
practical.  The  gradient  of  the  MI  between  the  template  and  study  is  used  to  compute  a 
body  force  bd(x)  which  acts  on  the  template  to  drive  it  into  alignment  with  the  study 
image  data. 

Since  d  does  not  change  throughout  the  deformation  process,  maximizing  MI  is  equiv¬ 
alent  to  minimizing  a  conditional  entropy: 

max  J(d;a)  =  max{i?(d)  +  H(  a)  —  H(  d,a)}  (4-14) 

=  max  {H(a)  —  H(d,  a)}  (4-15) 

=  minff(d|a)  (4.16) 

The  quantity  H( d,  a)  is  a  measure  of  the  uncertainty  introduced  by  the  imaging  process 
and  can  be  called  the  imaging  entropy. 
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The  body  force  is  computed  by  taking  the  localized  variation  of  H(d  \  a)  with  respect 
to  u.  Recall  that  a  is  a  function  of  u  since  a(x,  t)  =  a(x  -  u(x),  t). 


bd(x,u(x,t))  =  -Vu#(d  |  a) 

(4.17) 

=  EE  V«lPDA(<iA)iogPDA(d,a)) 

d  hi 

-  E  v»[/^(a)  tog  /^(a)] 

(4.18) 

a 


where  Vu  is  the  gradient  with  respect  to  the  displacement  field.  Letting  Xi  :  i  —  1, 2  be 
the  orthogonal  unit  vectors  specifying  the  2-D  Cartesian  space  of  the  images,  (4.18)  is 
rewritten  as 


i  /  Na-lNd-l  q 

bd(x,  u(x,  t))  =  £i  (  Qu.  loS  Pda(^  «)] 

i=l  \  d=0  d— 0  ' 


Na~  1 


-E 


d 


-[PA{a)  log  PA{a)] 


d 


to 

2  /  Na-1  Nd- 1 

i= 1  V  5=0  d=0  Xi^  ' 


(4.19) 


The  final  equality  follows  from  the  relationship 

d 


du&.{x) 


(PlogP)  =  (1  +  logP) 


d 


9u£<(x) 


(P) 


(4.20) 


Since  the  probability  distributions  are  discrete  rather  than  continuous,  the  partial 
derivatives  are  approximated  with  differences  based  on  the  four  locations  immediately 
surrounding  x.  Also  for  the  discrete  case,  g~j^j(p)  becomes  and  (4.19)  is  reduced 
to 


bd(x,u(x,t))  ft! 


J_A  /[-Ppj(^(x),a(x-^))  +  ^] 

Np  tl Xl  °S  V  [PDA(d(*)>  a(x  +  **))  +  2^p] 

[Pj(a(x  +  ^))  +  ^]\ 

Ipa( a(x  -  *<))  +  2^]/ 


(4.21) 
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The  factor  srr  is  also  a  function  of  the  discrete  difference  approximation  to  the  continuous 
derivative.  Equation  (4.20)  requires  the  value  of  P  at  the  point  the  derivative  is  taken. 
In  this  case  it  is  taken  to  be  halfway  between  the  two  values  used  in  the  difference. 

The  computational  complexity  of  calculating  for  all  x  using  (4.21)  is  linear  with 
respect  to  the  image  size.  Other  proposed  MI  gradient  methods  have  a  cost  which  is 
quadratic  with  respect  to  the  number  of  samples  used  to  calculate  the  MI  gradient  [30] . 

The  preceding  derivation  of  (4.21)  assumes  that  an  incremental  change  in  u  at  x 
only  changes  the  probability  mass  functions  for  graylevel  values  d  =  a(x),  a(x  ±  Xi)  and 
d  =  d(x).  This  assumption  is  not  strictly  true  when  the  Parzen  window  technique  is  used 
to  estimate  the  probability  mass  functions.  However,  (4.21)  is  still  a  good  approximation 
for  most  practical  window  functions  including  the  Gaussian  function.  The  dominant 
terms  of  (4.21)  will  be  those  for  which  changes  in  P(d ,  a)  and  P(a )  are  large  with  respect 
to  u.  Due  to  the  smoothing  nature  of  the  windowing  function,  the  contributing  value  of 
each  of  the  dominant  terms  will  be  approximately  equal  to  the  actual  value  calculated 
by  (4.21).  Thus  the  actual  gradient  will  be  approximately  related  to  the  value  calculated 
in  (4.21)  by  a  constant. 

To  test  the  utility  of  (4.21)  with  Parzen  windowing,  joint  and  marginal  probability 
mass  functions  were  estimated  for  a  template  and  cryosection  image  using  a  Gaussian 
window.  The  Gaussian  window  was  applied  to  smooth  only  the  graylevel  values  of  the 
input  image  and  not  those  of  the  template.  The  result  of  (4.21)  was  then  compared 
with  the  exact  MI  gradient  calculation.  Comparisons  were  done  for  Gaussian  window 
widths  of  a  =  5  and  a  —  10.  The  maximum  and  average  magnitude  differences  relative 
to  the  maximum  magnitude  of  the  body  force  are  shown  in  Table  4.1.  Values  which 
were  identical  using  (4.21)  and  the  exact  calculation  were  not  included  in  the  average 
calculation. 

In  order  for  the  gradient  method  to  be  feasible,  the  path  leading  to  the  global  mini¬ 
mum  (maximum)  must  be  smooth  so  the  search  will  not  be  attracted  to  a  local  extrema. 
To  examine  the  occurrence  of  local  extrema,  the  MI  between  two  images  was  measured 
as  one  was  scaled  smaller  and  larger.  Figure  4.11  shows  the  two  images  and  the  resulting 
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Table  4.1  Measurements  of  the  utility  of  (4.21)  when  Parzen  windowing  is  used.  Com¬ 
parison  is  made  between  the  MI  gradient  calculation  using  (4.21)  and  the  exact  calcula¬ 
tion.  Maximum  and  average  magnitude  differences  relative  to  the  maximum  body  force 
magnitude  are  shown.  _ _ _ 


var=100 

var=25 

b  Max  difference 

8.5% 

7.8% 

b  Ave  difference 

1.0% 

.5% 

Figure  4.11  Plot  of  the  MI  between  images  (a)  and  (b)  as  (b)  was  scaled  smaller  and 
larger. 

plot  of  the  MI  vs.  scale  value.  As  is  seen  by  the  graph,  the  resulting  plot  is  quite  smooth. 


To  examine  further  the  possibility  of  local  extrema,  a  test  was  performed  similar  to 
one  performed  by  Maes  et  al.  [32]  wherein  the  pixels  of  the  images  in  Figure  4.11  were 
randomly  mixed  so  that  the  marginal  intensity  distributions  and  self  entropies  remained 
unchanged  for  the  full-scale  images,  but  the  images  had  a  less  defined  structure.  Fig¬ 
ure  4.12  shows  the  modified  images  and  the  resulting  plot  of  the  MI  with  respect  to 
scaling.  The  global  maximum  occurs  at  the  same  point,  but  there  are  many  local  max¬ 
ima  and  the  region  of  attraction  is  much  smaller.  This  test  demonstrates  that  a  gradient 
search  method  will  work  better  on  well  structured  images. 


Figure  4.12  Plot  of  the  MI  between  images  (a)  and  (b)  as  (b)  was  scaled  smaller  and 
larger.  In  this  case,  images  do  not  have  a  well  defined  structure,  so  the  plot  is  not  very 
smooth,  and  the  region  of  attraction  for  the  global  maximum  is  small. 
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4.3  MI  Bias  and  Variance 


Because  an  image  contains  only  a  finite  number  of  pixels,  the  Parzen  probability  es¬ 
timate  will  have  nonzero  variance.  One  may  ask  how  this  affects  the  MI  calculation. 
It  turns  out  that  since  entropy  is  a  nonlinear  (in  fact  concave)  function  of  the  proba¬ 
bility  distribution,  the  entropy  estimate  is  biased.  Work  on  quantifying  both  the  bias 
and  variance  of  MI  and  entropy  estimates  from  a  finite  number  of  samples  has  been 
done  by  Moddemeijer  [60]  for  histogram-based  probability  estimates  of  continuous  data. 
Histogram-based  probability  estimation  is  essentially  equivalent  to  using  a  rectangular 
Parzen  window  with  unit  width.  This  work  is  summarized  below  with  slight  modification 
for  discrete  data,  followed  by  its  extension  to  nonrectangular  Parzen  windows. 

The  joint  and  marginal  histograms  of  two  images  a  and  d  are  defined  as 

Np  Np 

kd,a  =  ^  X]  5(d(m)  -  d,  a(n)  -  a)  (4.22) 

m= 1  n=l 
Np 

kd,-  =  ^2  ~  d )  (4-23) 

771=1 

Np 

ka  =  ^2,S(a(n)  -  a)  (4.24) 

71=1 

Let  uppercase  variables  Kd^,  Kd,.,  and  K.  d  be  random  variables  which  represent  the 
uncertainty  of  the  histogram  values  for  finite  Np.  The  distribution  of  Kd^  for  any  fixed 
a  and  d  follows  the  binomial  probability  law: 


E(Kdta)  =  NpP(d,  a) 


(4.25) 


and 


COV(Kdt&,K1}S)  =  { 


NpP(d,a)(l  —  P(d,  a))  if  a  =  d  and  7  =  5 
—NpP(d,  a)P(  7, 5)  otherwise 


(4.26) 
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where  P(d,  a)  is  the  “true”  joint  probability  of  random  variables  D  and  A.  The  joint 
entropy  and  MI  of  A  and  D  can  be  estimated  using 


(4.27) 

(4.28) 


4.3.1  Bias 

As  previously  stated,  the  estimators  in  (4.27)  and  (4.28)  are  biased.  In  the  case  of 
discrete  r.v.’s,  this  is  due  to  the  finite  sample  size  and  is  referred  to  as  JV-bias  in  [60]. 
The  expected  N- bias  can  be  calculated  exactly  since  the  distribution  of  Kd,d  is  known. 
For  joint  entropy  this  is 


N- bias  =  H{D,  A)  -  E(H(D,  A)) 


=  H(D,A) 


Np  , 

|  P(d,a))  ( 

A  n  Jr— n  ^ 


(4.29) 

(4.30) 


Substitution  of  the  binomial  distribution  leads  to 

Np 


IV-bias  =  H(D,  A) 


i  i-vvi 


N, 


d,a  k~0  ^  ^ 


P(d,d)fc(l-P(d,a))^ 


N„-k 


(  kd,a,Ep  kd,a,Np\ 

V  i°g  h,-k.,dj 


(4.31) 


The  A-bias  as  calculated  by  (4.31)  is  always  positive  due  to  the  concave  nature  of  —  p  logp; 
therefore,  the  entropy  estimate  is  expected  to  underestimate  the  actual. 


4.3.2  Variance 

Direct  calculation  of  the  variance  of  the  entropy  estimate  using  the  binomial  distri¬ 
bution  is  much  more  complicated.  Instead,  an  approximation  is  obtained  using  a  Taylor 
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expansion.  The  Taylor  expansion  of  (4.27)  is 


H( D ,  A)  =  Y,  I  "  P(d>  «)  loS  P (<*> «)  -  ( 

d,a  l  ' 


x(Kd,„-NpP(d,a))+0 


(4.32) 


and  the  variance  of  H(D,  A)  is 


VAR  (H(D,A))  =  E{H2(D,A))-E2(H{D,A)) 


(4.33) 


Substitution  of  (4.32)  into  (4.33)  gives 


VAR(tf(£>,  A))  =^E(tPyp^p^){wp  +  wp^ 

d,a 


x  E(Kd> a  -  NpP(d,  a))(Kcd  -  NpP1>s ) 


(E  (wp  +  w„  log  p(d’ s)) E(Kt*  _  N”p(d' s)))  +  ° 


(4.34) 


Further  substitution  of  (4.26)  gives 


VAR (H(D,A))  =^~  \Y,P(d,a)\og2P(d,a)-  (^P(d,a)logP(d,d) 


p  \  d.a 


(4.35) 


Similar  derivation  yields 


VAR  Ad, A))  4jE^5)'»8,^)-(Ef,«i)'^ 


^  \  <2, a 


P{d,d) 


(4.36) 


Equations  (4.35)  and  (4.36)  can  be  rewritten  as 


VAR(H(D,  A))  =  ^-VAR(log  P(d,  a ))  +  O  ( 4 


(4.37) 


50 


and 


VAR(/(D,  A))  =  ±VAR  (log  ^L)  +  O  (^)  (4.38) 

which  clearly  show  that  the  variance  is,  to  a  first  approximation,  a  function  of  only  the 
number  of  samples  and  the  underlying  distribution. 


4.3.3  Generalization  for  Parzen  windows 

The  N- bias  calculation  can  be  extended  to  predict  the  MI  and  entropy  bias  for  the 
case  that  a  more  general  Parzen  window  is  used  to  estimate  the  probability  distributions 
rather  than  the  Kronecker  delta  function  in  (4.22)-(4.24).  The  resulting  entropy  or  MI 
estimate  has  two  bias  components  rather  than  the  single  N- bias  component.  Although 
smaller,  the  iV-bias  remains.  The  other  is  based  on  the  fact  that  the  Parzen  window 
inherently  smoothes  the  underlying  distribution.  This  bias  component  will  be  called 
5-bias. 

Recall  from  Section  3.2  that  for  the  Parzen  window  technique,  the  probability  estimate 
for  a  specific  ( d ,  o)  approaches  a  normal  distribution  with  mean 

B(^)“P(d,a)  (4'39) 

and  variance 

VAR  «  P(d,d )  £  1^2)  (4-40) 

'  P  '  ^1,^2 

Equation  (4.40)  shows  that  the  variance  of  P[k  \  P(d,  a))  is  affected  by  the  energy  in  the 
Parzen  window.  Applying  a  correction  to  the  variance  due  to  the  nonunit  energy  of  the 
windowing  function  leads  to 

N 

N- bias  «  H(D,  i)  -  £  R>„  u2)  £  £  PA  W  o)‘(l  -  P(d, a))"'"4 

,1^2  d,a  k= 0  '  ' 

f  kd^aNp  k^a^p\ 

^  log  k~k~J 
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Although  the  estimate  is  still  expected  to  be  less  than  the  actual  entropy,  typical  Parzen 
windows  will  have  energy  less  than  unity  and  will  reduce  the  iV-bias.  For  example,  a  1-D 
Gaussian-shaped  Parzen  window  with  standard  deviation  5  will  have  an  energy  of  0.056, 
so  the  iV-bias  will  be  reduced  by  approximately  94%. 

The  second  bias  introduced  by  the  Parzen  window,  the  5-bias,  is  essentially  due  to 
the  nonzero  Parzen  window  width  required  for  a  finite  number  of  samples.  As  discussed 
in  Section  3.2,  the  Parzen  density  estimate  actually  returns  P(d,a )  *  R(d,a),  so  the 
Parzen-based  entropy  estimate  is 

H(P(d,  a))  =  H(P(d,  a)  *  R(d,  d))  (4.42) 

and  the  S- bias  is  computed  by 

5-bias  =  H(P(d,  d)  *  R{d,  d))  -  H(P(d,  d))  (4.43) 

Note  that  the  5-bias  is  not  a  function  of  Np,  only  of  P(d,  a)  and  R,  and  since 

H (P{d,  a))  <  H(P(d ,  a)  *  R(d,  a))  (4.44) 

the  5-bias  leads  to  an  overestimation  of  the  entropy  or  an  underestimation  of  the  MI. 

Inspection  of  (4.35)  reveals  that,  to  a  first-order  approximation,  the  variance  is  largely 
only  a  function  of  the  true  underlying  probability  and  the  number  of  samples  acquired. 
Thus,  the  Parzen  window  will  have  a  limited  effect  on  the  variance  of  the  estimation. 
Equation  (4.35)  can  be  rewritten  as 

VAR (H(D,  A))  =  (  £(P(d,  a)  .  R)  log2(P  W  «)  *  R ) 

P  \  d,a 

-  a)  * R)  loi(P[d' a)  * fl)) ) + 0  (j|)  <4-45> 

Thus  as  the  window  width  increases  the  variance  will  decrease. 

Figure  4.13  shows  actual  entropy  estimates  as  a  function  of  Np,  based  on  samples 
from  a  PMF,  as  well  as  the  expected  value  of  the  entropy  estimates,  based  on  the  N- 
bias  calculation.  The  AT-bias  is  calculated  according  to  (4.31).  The  solid,  smooth  curve 
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Figure  4.13  Plot  demonstrating  predictability  of  iV-bias.  The  solid  smooth  curve  shows 
the  expected  entropy  estimate,  corrected  for  iV-bias,  vs.  Np.  The  jagged  curve  is  the 
actual  experimental  entropy  estimate.  The  bias-corrected  calculation  closely  predicts  the 
experimental  estimate  . 

shows  the  expected  N- biased  estimate.  The  jagged  curve  shows  experimental  estimates 
according  to  (4.27)  for  different  values  of  Np.  The  experimental  results  closely  follow 
the  calculated  expected  result.  Figure  4.14  shows  results  of  a  similar  simulation.  In  this 
case,  the  upper  set  of  curves  are  actual  and  expected  entropy  estimates  when  a  Gaussian 
window  with  a  —  5  was  used  to  estimate  the  probabilities.  The  lower  set  of  curves  are 
those  from  Figure  4.13  for  comparison.  The  use  of  the  Gaussian  window  reduces  the  bias 
for  small  Np,  but  becomes  a  slight  over-estimate  due  to  the  dominance  of  the  S'-bias  for 
large  Np. 

The  variance  of  the  measured  entropy  estimates  without  and  with  the  Parzen  window 
technique  are  more  clearly  shown  in  Figures  4.15  and  4.16,  respectively.  The  figures  show 
the  difference  between  the  empirically  measured  entropy,  based  on  Np  samples,  and  the 
expected  entropy,  based  on  knowledge  of  the  PMF  and  Np.  The  upper  and  lower  smooth 
curves  represent  two  standard  deviations  from  the  expected  entropy  as  calculated  using 
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8.5 


Figure  4.14  Demonstration  of  the  predictability  of  iV-bias  and  5-bias  when  Parzen 
windowing  is  used.  Upper  set  of  curves  shows  the  predicted  and  actual  Parzen-based 
entropy  estimates  vs.  Np.  Lower  set  of  curves  are  those  of  Figure  4.13  for  reference. 

(4.35).  As  expected,  approximately  95%  of  the  samples  lie  within  the  2 a  interval.  The 
magnitudes  of  the  variations  are  quite  similar  with  and  without  Parzen  smoothing. 

As  a  final  note  on  the  bias  and  variance,  experiments  were  conducted  to  determine  the 
effect  of  a  Parzen  window  on  the  location  of  the  global  maximum  for  mutual  information. 
Figure  4.17  shows  the  MI  vs.  shift  plot,  similar  to  Figure  4.8,  of  the  1-D  images  from 
Figure  4.4  using  a  Gaussian  window  of  varying  widths.  Increasing  the  window  width 
increases  the  underestimation  of  the  MI  and  reduces  the  dynamic  range,  but  the  global 
minimum  does  not  change  position.  Figure  4.18  shows  the  curves  of  Figure  4.17  after 
scaling  them  to  a  dynamic  range  of  (0,1).  The  normalized  curves  are  nearly  identical  for 
a  wide  range  of  windows.  Similar  results  are  obtained  using  a  rectangular  window  with 
width  ranging  from  between  1  and  120  pixels. 
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Figure  4.15  Demonstration  of  the  predictability  of  entropy  estimation  variance.  The 
upper  and  lower  smooth  curves  represent  two  standard  deviations  from  the  mean  as 
calculated  by  (4.35). 


Figure  4.16  Demonstration  of  the  predictability  of  entropy  estimation  variance  when 
Parzen  windowing  is  used.  The  variance  is  quite  similar  in  magnitude  to  that  of  Figure 
4.15.  The  curved  lines  are  added  as  a  reference  for  comparison  with  the  variance  plot  of 
Figure  4.15. 


55 


Figure  4.17  Alignment  error  curves  using  MI  and  Gaussian-shaped  Parzen  windows 
of  varying  widths.  As  the  window  width  increases,  the  dynamic  range  of  the  curve  is 
reduced  and  the  bias  increases. 


Figure  4.18  Normalized  alignment  error  curves  using  MI  and  Gaussian-shaped  Parzen 
windows  of  varying  widths.  The  misalignment  curve  is  quite  similar  for  a  wide  range  of 
window  widths. 
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4.4  Bound  on  the  Maximal  Mutual  Information 

An  upper  bound  is  now  defined  for  the  MI  between  d  and  a.  To  derive  the  bound,  u 

must  be  limited  appropriately.  In  the  context  of  image  registration,  u  defines  a  coordinate 

transform,  so  a(x)  —  a(x  —  u)  maintains  the  same  number  of  graylevels  as  a.  Let  G(a) 

be  a  function  determining  the  number  of  graylevels  in  an  image,  and  define  the  following 
sets: 


and 


“u  =  {u  I  G( a(x  -  u)  <  (7(a) } 


Wa  =  {a  I  (7(a)  <  (7(a) } 


It  immediately  follows  that 


max  7 (d;  a)  =  max/(d;a)  4  B 

ueo>u  a6u,a  v  ’  > 


(4.46) 


(4.47) 


(4.48) 


and 


(4.49) 


7(d;  a)  <  B 

The  equality  of  (4.48)  is  obvious  noting  that  ua  =  (a  |  u  G  uju}. 

A  more  insightful  bound  for  7(d;  a)  is  now  derived. 

Theorem  4.1  Consider  two  images,  d  and  a,  such  that  G(a)  <  G(d).  Let  Cja  =  {a  | 
7f(a  |  d)  =  0,  a  G  cva};  then, 

B  =  (4.50) 

Proof:  First,  note  that  for  any  a  G  cua 

7(d;  a)  =  77(a)  —  77(a  |  d)  =  77(a) 

Also,  ua  is  not  empty  under  the  condition  (7(a)  <  G(d).  This  can  be  understood 
from  the  fact  that  there  always  exists  an  a  6  w„  such  that 

P{aj  |  di)  =0  or  P(aj  |  %)  =  1,  Vi,j  (4.51) 
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where  di  and  a*  denote  the  intensity  values  of  d  and  a,  respectively.  For  example, 
(4.51)  is  achieved  when  the  intensity  mapping  from  d  to  a  is  one-to-one  (in  the  case  of 
G{ a)  =  G(d))  or  many-to-one  (if  G( a)  <  G(d)). 

Next,  (4.50)  is  proven.  Note  that  (2)a  C  u>a,  so 

max  /(d;a)  >  max/(d;  a) 

Therefore,  it  suffices  to  show  that  for  any  a  E  ua,  there  exists  an  a  E  ua  such  that 

/(d;a)</(d;a)  (4.52) 

Clearly,  if  a  happens  to  be  in  u)a,  then  a  =  a  satisfies  (4.52).  Now,  assume  that 
a  Cj a]  that  is,  H(a  |  d)  7^  0.  Then  there  exists  dx  mapped  to  (at  least)  two  different 
values  di  and  a2.  Define  variables  Po,o  and  A  such  that  PDA{di,ax)  =  P0) 0  ~  A  and 
PDj4(di,a2)  =  A,  where  0  <  A  <  P0io-  Also  define  P0  =  Pa{q  —  ax  |  A  =  0)  and 
Pi  =  P^a  =  a2  |  A  =  0).  The  value  of  A  can  be  adjusted  by  selecting  different  images 
aA  £  00 a-  Based  on  these  definitions, 

/( d;  aA)  =H( d)  +  H( aA)  -  H{ d,  aA) 

=P(d)-(P0-A)log(P0-A) 

-  (Px  +  A)  log(Px  +  A)  -  P(°) log  P(°) 

+  (Po,0  —  A)  log(P0)0  —  A)  +  A  log  A 
+  53  E  Wa)logP(d.a) 

(d,a)/(di,ai) 

(d,a)^(d2,a2) 

Treating  /(d;aA)  as  a  function  of  A  leads  to 

d/(d;aA)  (Po-A)A 

dA  °S  (Pi  +  A)(P0,o  -  A) 

and 

d2I(d;  aA)  1  1  1  1 

dA2  ~PW~A  P0-  A  +  &  P,  +  A 


(4.53) 


The  first  term  of  (4.53)  is  always  larger  than  the  second  term,  and  the  third  term  is 
larger  than  the  forth,  so  /( d;a^)  is  convex  with  respect  to  A  (see  [48],  p.  31).  From 
this  convex  property,  it  follows  that  /( d;  aA)  is  maximal  when  A  =  Po,o  or  when  A  =  0. 
This  corresponds  to  either  P{a\  \  =  0  or  P(a2  \  d\)  =  0,  respectively.  If  the  resulting 

maximal  aA  belongs  to  ua,  then  setting  a  —  aA  achieves  the  inequality  of  (4.52).  If  aA  £■ 
Cja ,  then  following  the  same  analysis,  there  exist  di  and  a2  in  aA  such  that  P(di  |  di)  0 
and  P(o2  |  di)  ^  0  for  some  d\.  However,  a  new  a  a  €  ua  can  be  found  which  has  higher 
mutual  information  with  d,  and  P(5i  |  di)  =  0  or  P(a2  |  dj)  =  0.  Therefore,  a  necessary 
condition  on  any  template  a*  such  that  7(d;  a*)  achieves  its  maximal  value  is  that  there 
exist  no  Oi  and  a2  such  that  for  a  fixed  value  d\  both  P{a\  \  d\)  ^  0  and  P(o,2  \  d\)  ^  0. 
In  other  words,  a*  €  Cja.  This  proves  (4.52)  and  the  theorem. 

Remark  4.1: 

B  <  log(G(a))  (4.54) 

The  above  inequality  can  be  easily  proven  noting  that 

maxP(a)  <  maxF(a)  =  logG(a) 

a€cZ>a  aGcdo 

Remark  4.2: 

Equation  (4.50)  still  holds  if  the  set  ua  is  tightened  such  that 

=  {a  |  G( a)  =  G(a)} 

This  can  be  understood  from  the  following  facts: 

1.  If  G( a)  <  G(a)  and  a  G  <2)a,  a  corresponding  a'  can  be  found  such  that  G'(a')  =  G( a) 
and  a'  G  (2>a- 

2.  maxa£Uo  H( a)  increases  as  G(a)  increases. 

4.4.1  Discussion  of  the  bound 

It  is  clear  from  Theorem  4.1  that  the  MI  bound  can  be  calculated  directly  by  maximiz¬ 
ing  the  entropy  of  the  template  under  the  constraint  that  the  conditional  entropy  of  the 
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Figure  4.19  Synthetic  image  with  multiplicative  noise  and  several  possible  segmenta¬ 
tions,  including  one  that  achieves  the  bound,  (a)  Synthetic  image  with  multiplicative 
noise,  (b)  correct  manual  segmentation,  (c)  an  incorrect  segmentation,  (d)  another  in¬ 
correct  segmentation  which  achieves  the  bound  set  by  (4.50). 


template,  given  the  fixed  image,  be  zero.  If  all  possible  transformations  of  a  template  are 
allowed,  the  bound  is  obtainable,  and  the  bound-achieving  template  is  not  unique.  This 
implies  that  if  a  transform  is  too  “powerful”  or  “flexible,”  maximizing  the  MI  may  result 
in  an  undesirable  deformation.  To  illustrate  this  point,  consider  the  images  in  Figure 
4.19.  Figure  4.19(a)  shows  an  image  which  has  been  corrupted  by  multiplicative  noise 
such  that  it  contains  21  distinct  intensities.  Figures  4.19(b)-(d)  represent  candidate  tem¬ 
plate  images  which  all  contain  four  distinct  symbols.  The  bound  computed  from  (4.50) 
for  Figure  4.19(a)  and  any  template  containing  four  symbols  is  0.963.  The  MI  between 
Figures  4.19(a)  and  4.19(b)  is  0.658,  between  Figures  4.19(a)  and  4.19(c)  the  MI  is  0.110, 
and  Figures  4.19(a)  and  4.19(d)  achieve  the  bound. 

A  similar  study  was  conducted  with  a  cryosection  image  of  a  human  head.  Figure 
4.20(a)  is  the  head  image  and  Figures  4.20(b)  and  (c)  are  templates  each  containing 
five  distinct  symbols.  The  MI  between  Figures  4.20(a)  and  4.20(b)  is  1.210  and  Figures 
4.20(a)  and  4.20(c)  achieve  the  bound  of  1.674. 

Although  human  observers  consider  Figures  4.19(b)  and  4.20(b)  to  be  most  closely 
aligned  with  Figures  4.19(a)  and  4.20(a),  Pairs  4.19(a)  and  4.19(d)  and  4.20(a)  and 
4.20(c)  have  the  highest  MI.  Figure  4.21  shows  more  explicitly  the  lack  of  structure  in 
Figure  4.20(c).  To  avoid  this  type  of  problem  in  deformable  image  registration,  structural 
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(a)  (b)  (c) 

Figure  4.20  Real  cryosection  image  and  several  possible  segmentations,  including  one 
that  achieves  the  bound,  (a)  Cryosection  image  of  a  human  brain,  (b)  correct  manual 
segmentation,  (c)  another  segmentation  which  achieves  the  bound  set  by  (4.50). 


(f)  (g)  (b)  (i)  (j) 

Figure  4.21  Bound  achieving  template  separated  by  graylevel.  Top  row  shows  individual 
graylevels  of  Figure  4.20(b).  Bottom  row  shows  individual  graylevels  of  Figure  4.20(d). 
The  bottom  row  clearly  lacks  the  structure  of  the  top  row.  The  graylevel  indicating 
background  is  the  only  one  which  maintains  a  correct  structure. 
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Figure  4.22  Relationship  between  log(G(a)),  the  derived  bound,  and  J(d;a),  where  u 
is  a  scaling  transform.  The  curves  are  based  on  the  data  in  Fig.  4.20. 


and  topological  constraints  should  be  used,  as  is  the  case  with  elastic  or  fluid  models 
[34,  57] 

Figure  4.22  illustrates  the  relationship  between  the  bounds  in  (4.50)  and  (4.54)  and 
/(d;a)  based  on  the  data  in  Figure  4.20.  Here  we  assume  u  is  a  simple  scaling  transform. 
As  can  be  seen,  the  derived  bound  is  much  tighter  than  log(G'(a)).  An  even  tighter  bound 
could  also  be  derived,  but  it  would  be  dependent  on  the  characteristics  of  the  transform 
and  the  image  data. 


4.5  Optimality  of  Mutual  Information 

From  the  discussion  of  Section  4.4.1  it  is  clear  that,  in  some  cases,  the  template 
which  maximizes  I{D\  A)  will  not  correspond  to  the  correct  structure.  It  is  instructive  to 
determine  the  circumstances  under  which  the  maximal  MI  will  give  meaningful  results. 
It  can  be  proven  that  if  the  imaging  function  C  is  one-to-one,  then  MI  is  optimal  in 
the  sense  that  the  correct  segmentation  is  expected  to  maximize  the  MI.  Recall  from 
(4.16)  that  maximizing  I(D\A)  is  equivalent  to  minimizing  H(D  |  A),  and  let  /(•)  be  a 
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one-to-one  mapping;  then, 


ff(/(s) +  1)  |  S)  =  H(-n) 

(4.55) 

=  fffo|a) 

(4.56) 

<  +  t;  |  a),  Va 

(4.57) 

In  other  words,  the  MI  between  the  correct  segmentation  and  the  study  image  is  expected 
to  be  at  least  as  large  as  the  MI  for  any  other  segmentation  choice.  If,  however,  the 
graylevel  mapping  is  not  one-to-one,  the  equality  of  (4.55)  may  not  hold,  so  s  may  not 
be  the  maximal  MI  choice. 

It  should  also  be  noted  that  (4.55)  and  (4.57)  assume  white  noise.  In  reality  any 
particular  instance  of  the  white  noise  process  will  be  correlated.  In  fact,  it  is  the  goal  of 
the  maximal  MI  deformation  to  correlate  the  template  to  the  image  as  much  as  possible. 
So,  although  the  expected  maximizing  deformed  template  is  s,  the  deviation  from  the 
expected  template  for  any  particular  study  image  will  depend  on  both  the  noise  and  the 
structure  of  the  image. 

There  is  ambiguity  associated  with  the  equality  portion  of  (4.57).  It  is  manifest  in 
the  fact  that  symmetries  are  indistinguishable.  The  three  images  of  Figure  4.23  are 
considered  equivalent  using  the  MI  metric.  This  may  not  pose  a  serious  threat,  since  the 
segmentations  boundaries  are  not  changed. 


Figure  4.23  Symmetry  problem  example.  The  three  images  are  equivalent  in  an  infor¬ 
mation  sense. 

Figure  4.24  illustrates  one  way  a  non-one-to-one  mapping  can  lead  to  problems.  Figure 
4.24(a)  has  a  structure  most  similar  to  Figure  4.24(b),  but  because  Figure  4.24(b)  uses 
the  black  intensity  for  two  different  structures,  the  MI  between  (a)  and  (c)  is  greater 
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Figure  4.24  Many-to-one  graylevel  mapping  problem  example.  Visually,  (a)  and  (b) 
have  equivalent  structure,  but  the  mutual  information  is  highest  for  the  pair  (a)  and  (c). 


than  (a)  and  (b).  This  problem  can  be  partially  avoided  by  selecting  a  template  such 
that  each  structure  uses  a  different  graylevel. 

4.5.1  Maximum  likelihood  optimality 

The  fundamental  assumption  governing  the  use  of  mutual  information  is  that  the 
mutual  information  between  the  study  image  d  and  the  deformed  template  a  is  maximal 
when  the  template  is  equivalent  to  the  underlying  structure  s  of  the  image.  Under  certain 
conditions,  this  is  equivalent  to  the  maximum  likelihood  decision.  In  a  probabilistic  sense, 
an  optimal  template  can  be  defined  by 

s  =  argmaxP(a  |  d)  (4.58) 

a 

Using  the  definition  of  information  (3.53)  and  the  monotonicity  of  the  log  function,  (4.58) 
becomes 


s  =  arg  min /(a  |  d)  (4.59) 

a 

=  argminl(d  |  a)  +  1(a)  (4.60) 

a 

where  the  second  equality  follows  from  the  information  theory  equivalent  of  Bayes’  rule. 
In  other  words,  when  determining  which  deformed  template  is  best  aligned  with  the 
underlying  structure  of  an  image,  select  the  deformed  template  which  is  least  “surprising” 
based  on  the  observed  data. 

An  information-theoretic  approach  is  used  for  three  reasons: 
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•  It  provides  a  theoretical  basis  through  which  optimization  can  be  achieved  indepen¬ 
dent  of  the  amount  of  prior  knowledge  about  the  imaging  system.  In  the  absence 
of  information  about  the  system,  a  solution  can  be  found  which  minimizes  the 
complexity  of  the  system. 

•  The  problem  can  be  related  to  concepts  in  information  theory  —  such  as  informa¬ 
tion,  mutual  information,  and  entropy  —  which  aid  in  the  understanding  of  the 
problem. 

•  Mathematical  identities  and  theorems  are  in  place  which  facilitate  analysis. 

The  second  term  on  the  right-hand  side  of  4.60  is  the  prior  information  about  the 
template.  Minimization  of  this  term  will  be  discussed  in  Chapters  5  and  6;  the  main 
focus  here  is  on  the  first  term. 

A  major  problem  with  applying  (4.60)  in  its  current  form  is  that  P(d|a)  is  difficult 
to  estimate  since  we  only  have  one  example  of  d.  However,  assuming  elements  of  the 
random  vector  (d|a)  are  independent  and  identically  distributed  (equivalently,  the  imag¬ 
ing  process  is  memoryless  and  stationary),  the  first  term  of  (4.60)  can  be  rewritten  in  a 
more  useful  form.  Prom  the  identical  distribution  assumption  define  variables  d  and  a 
such  that 


/(d(i)  |  a(*))  =  I(d(j)  |  SO'))  =  I(d  \  a),  Vi,j  (4.61) 

Then  (neglecting  for  a  moment  the  prior  information) , 

s  =  min  J(d  |  a)  (4-62) 

a 

=  min  [—  log  P(d  |  a)]  (4.63) 

a 

Np 

=  min^  —  logP(d(i)  |  a(i))  (4.64) 

a  ^ J 

i= 1 

Na-l  Nd-1 

=  min- ^2  ^  PDA(d,a)\ogPDlA(d\a)  (4.65) 

a— 0  d=0 

=  mmH(D\A)  (4.66) 

a 

=  ma xI(D]A)  (4-67) 

a 
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Line  (4.63)  follows  from  the  definition  of  information  in  (3.53),  and  (4.64)  is  a  result  of 
the  independence  assumption.  Line  (4.65)  follows  from  (4.61)  and  the  fact  that  P D^(d,  a) 
is  directly  proportional  to  the  number  of  times  graylevels  d  and  a,  occur  concurrently  in 
the  image  and  template,  respectively.  Thus,  the  summation  over  all  pixel  locations  is 
converted  to  a  summation  over  all  graylevels  in  both  the  image  and  template. 

One  conclusion  of  (4.62)-(4.67)  is  that  maximization  of  mutual  information  is  optimal 
in  a  Bayesian  sense  under  certain  conditions.  The  conditions  are  independent,  identi¬ 
cally  distributed  elements  of  (d  |  a)  and  equally  likely  a.  As  previously  mentioned  in 
Section  2.4,  several  information-based  metrics  have  been  proposed  as  alignment  metrics. 
Studies  have  shown  that  each  of  them  may  perform  better  than  others  under  specific 
circumstances;  however,  MI  has  been  most  widely  accepted  for  general  application.  This 
is  consistent  with  the  probabilistic  optimality  of  MI  for  unknown  contrast. 

A  second  conclusion  is  that  the  problem  of  finding  the  least  surprising  deformed 
template  considering  only  the  image  data  is  equivalent  to  minimizing  H(D  \  A),  which 
is  the  uncertainty  introduced  by  the  imaging  system;  this  is  equivalent  to  finding  the 
simplest  imaging  process.  For  example,  a  very  simple  channel  (simple  in  the  Kolmogorov 
sense,  but  maybe  hard  to  construct!)  may  just  produce  s  as  the  output,  but  a  very 
complex  imaging  channel  might  make  a  battleship  look  more  like  a  butterfly.  One  may 
not  assume,  however,  that  the  imaging  process  used  in  any  instance  is  a  minimum  entropy 
process.  The  result  of  this  assumption  may  lead  to  images  similar  to  those  of  Figures 
4.19(d)  and  4.20(d). 

It  is  notable  that,  in  going  from  (4.64)  to  (4.65),  the  minimization  changes  from 
dependence  on  Np  to  dependence  on  Na  and  N^.  Thus,  the  MI  calculation  is  a  constant 
with  respect  to  the  image  size. 

4.6  Summary 

This  chapter  has  investigated  use  of  mutual  information  as  an  alignment  metric  for 
non-rigid-body  template  matching.  It  was  shown  that  MI  has  the  desirable  characteristic 
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that  knowledge  of  the  imaging  function  is  not  required.  Thus  it  is  useful  for  images  from 
a  wide  range  of  imaging  methods.  An  in-depth  study  was  conducted  on  the  bias  and 
variance  of  the  empirical  MI,  as  well  as  the  use  of  Parzen  windows  to  estimate  the 
probabilities  associated  with  the  images.  It  was  argued  theoretically  that  Parzen  window 
estimation  may  reduce  sensitivity  of  a  deformation  to  noise  and  partial  volume  effects. 

A  gradient  calculation  was  presented  for  use  in  the  deformation  process.  The  calcu¬ 
lation  scales  linearly  with  the  image  size,  so  it  is  readily  scalable  to  large  images.  Utility 
of  the  gradient  descent  method  was  found  to  be  dependent  on  the  image  and  template 
structures.  The  derived  gradient  method  is  valid  for  both  normalized  histogram-based 
MI  and  Parzen  window-based  MI. 

A  bound  was  also  found  on  the  maximal  MI  between  a  given  image  and  a  template. 
The  bound  provided  useful  insight  into  the  use  of  MI  as  an  alignment  metric  for  template 
deformation. 

Finally,  the  conditions  for  optimality  were  explained  under  two  different  optimality 
criteria.  The  conditions  for  Bayesian  optimality  are  as  follows: 

•  Elements  of  the  random  vector  d  are  independent  given  a. 

•  Elements  of  (d|a)  are  identically  distributed. 

•  The  possible  a  have  equally  likely  probability. 

A  sufficient  condition  to  ensure  that  the  true  underlying  structure  is  the  maximal  MI 
structure  is  that  the  imaging  process  be  one-to-one.  Problems  related  to  non-one-to-one 
imaging  processes  can  be  mitigated  by  reducing  the  number  of  graylevels  in  the  template 
and  using  a  separate  graylevel  for  each  structure. 
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CHAPTER  5 


A  FLUID-BASED  TRANSFORM  MODEL 


This  chapter  addresses  the  question  of  how  to  determine  a  displacement  field  so  as  to 
maximize  the  similarity  metric  while  still  maintaining  a  meaningful  result.  To  accomplish 
this,  a  transformation  model  C  is  defined,  which  models  all  the  feasible  transformations. 
The  displacement  field  is  related  to  the  body  force  (and  thus  the  study  image)  through 

£u  +  b  =  0  (5.1) 

where  b  is  the  body  force  and  u  is  the  template  displacement.  For  image  segmentation,  it 
is  desirable  that  the  deformation  preserve  the  template  topology  and  accommodate  large, 
curved  displacements.  It  has  been  shown  by  Christensen  et  al.  [25,  61]  that  deformations 
based  on  a  non-mass-conserving  viscous  fluid  model  preserve  these  features.  This  chapter 
first  reviews  this  fluid  model,  then  proposes  a  modified  model  that  preserves  the  desirable 
properties  of  the  fluid  model,  but  requires  less  computation  and  may  lead  to  more  intuitive 
incremental  deformations. 


5.1  Terminology  and  Relevant  Theorems 

5.1.1  Topology  conservation 

Two  images  are  considered  topologically  equivalent  if  and  only  if  the  two  images  can 
be  related  through  a  homeomorphism.  A  homeomorphism  is  a  continuous,  one-to-one, 
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Figure  5.1  Homeomorphic  vs.  nonhomeomorphic.  (a)  and  (b)  axe  homeomorphic,  but 
(c)  is  not  homeomorphic  with  either  (a)  or  (b).  The  single  circle  of  (a)  or  (b)  would  have 
to  be  torn  to  deform  to  (c) . 

onto  mapping  between  two  geometric  figures,  which  also  has  a  continuous  inverse.  Home¬ 
omorphic  mappings  include  operations  such  as  bending  and  stretching;  tearing,  however, 
is  not  allowed.  A  sphere,  for  example,  is  topologically  equivalent  to  an  ellipsoid  because 
they  are  related  through  stretching.  A  single  sphere  is  not  equivalent  to  two  spheres 
since  the  single  sphere  would  have  to  be  torn  to  map  to  two  spheres.  This  is  illustrated 
in  Figure  5.1.  In  the  case  of  a  deformation  of  Figure  5.1(a)  to  5.1(c),  the  point  at  which 
white  space  appears  between  the  two  circles  in  (c)  marks  a  discontinuity  in  the  mapping 
between  the  images  since  it  is  ambiguous  whether  the  particle  comes  from  above  or  below 
the  initial  sphere. 

5.1.2  Large,  nonlinear  deformations 

Deformable  models  such  as  linear  elastic  models  assume  that  deformations  are  small 
enough  that  the  nonlinear  terms  present  in  the  actual  physical  elastic  model  may  be 
neglected.  Deformations  of  this  type  follow  straight-line  trajectories,  and  are  penalized 
in  proportion  to  the  distance  of  the  deformation. 

On  the  other  hand,  large  deformation  models,  such  as  the  fluid  model  used  in  this 
thesis,  do  not  expressly  restrict  the  magnitude  of  the  displacements.  Instead,  only  the 
rate  of  displacement  is  regulated.  The  model  is  also  nonlinear  because  the  displacements 
are  allowed  to  follow  curved  trajectories. 
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5.1.3  Eulerian  formulation 


Deformations  are  represented  using  an  Eulerian  formulation.  Within  continuum  me¬ 
chanics,  the  coordinate  system  used  to  formulate  problems  is  based  on  either  an  Eulerian 
or  Lagrangian  formulation.  Eulerian  formulations  define  a  fixed  coordinate  system,  x,  y , 
2,  and  t.  The  fluid  at  a  specific  location  in  the  coordinate  system  will  consist  of  differ¬ 
ent  actual  fluid  at  different  times,  based  on  the  flow  of  the  fluid.  In  contrast,  under  a 
Lagrangian  formulation  the  coordinate  system  moves  with  a  particular  volume  of  fluid 
being  tracked.  In  this  thesis,  the  Eulerian  formulation  is  used.  The  Eulerian  formulation 
usually  leads  to  smaller  errors  from  interpolation  effects.  The  Eulerian  formulation  is  also 
more  convenient  when  comparing  the  template  to  a  study  image  and  for  display  purposes. 
As  the  template  deforms,  moving  (Lagrangian)  coordinates  may  not  correspond  to  the 
fixed  (Eulerian)  coordinates  used  in  the  study  image.  Thus,  an  extra  interpolation  step 
would  be  required  to  convert  the  relocated  points  to  the  values  on  the  discrete  lattice. 

Figure  5.2  illustrates  the  relationship  between  the  deformed  template,  the  original 
template,  and  the  displacement  field.  The  displacement  vector  field  at  any  point  x  and 
time  t  points  from  the  original  particle’s  location  to  its  current  location.  Thus,  the 
particle  at  x  at  time  t  originated  at  location  x  —  u(x)  in  the  original  template  at  time 
t  =  0. 

5.1.4  Pertinent  theorems 

As  the  transform  described  in  this  chapter  is  a  homeomorphism,  it  is  instructive  to 
present  some  definitions  and  theorems  by  which  this  property  can  be  proven.  Proofs  of 
the  theorems  have  been  omitted. 

A  transformation  is  said  to  be  continuous  if  it  satisfies  the  following  definition. 

Definition  5.1  ([62],  p.  332)  A  transformation  h  defined  on  a  set  D  C  Rn  is  said  to 
be  continuous  at  point  po  E  D  if  and  only  if  for  any  e  >  0  there  is  a  5  >  0  such  that 
|h(p)  —  h(p0)|  <  e  whenever  |p  —  Po|  <  <5  and  p  £  D. 
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Fixed  Reference 


Figure  5.2  Illustration  of  the  relationship  between  a  point  fixed  point  in  the  deformed 
template  and  points  in  the  original  template  as  time  varies.  (Adapted  from  [25].) 

Furthermore,  a  transformation  that  is  continuous  for  every  point  in  D  is  said  to  be 
continuous  on  D. 

The  next  definition  and  theorem  state  that  for  continuous  transforms,  structures  that 
are  adjacent  to  each  other  before  transformation  are  still  adjacent  after  transformation 
even  though  they  may  have  changed  shape. 

Definition  5.2  ([63])  A  connected  set  is  a  set  which  cannot  be  partitioned  into  two 
nonempty  subsets  which  are  open  in  the  relative  topology  induced  on  the  set.  Equivalently, 
a  connected  set  cannot  be  partitioned  into  two  nonempty  subsets  such  that  each  subset 
has  no  points  in  common  with  the  closure  of  the  other. 

Theorem  5.1  ([62],  p.  333)  Let  h  be  continuous  on  a  set  D.  Then  any  connected  set 
S  C  D  is  carried  into  a  connected  set  h(6'). 

Also,  any  concatenation  of  continuous  transforms  is  a  continuous  transform,  as  is 
stated  by  the  next  theorem. 
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Theorem  5.2  ([62],  p.  333)  Let  the  transformation  g  be  continuous  on  a  set  A  and  h 
be  continuous  on  a  set  B,  and  let  p0  €  A  and  g(p0)  G  B.  The  product  transformation 
h(g(p))  is  continuous  at  po- 


5.2  Fluid  Model 


In  the  viscous  fluid  model  presented  by  Christensen  et  al.  [25,  61],  a  transform  h(x,  t) 
is  computed  by 

h(x,£)  =  x  -  u(x,t)  (5.2) 

where  x  takes  on  values  in  X.  The  notion  of  a  velocity  field  v(x,  t)  is  introduced  which 
is  related  to  the  displacement  field  through  the  full  derivative  of  u  by 


v(x,t)  =  ^U0M) 


(5.3) 


The  solution  to  this  ordinary  differential  equation  (ODE)  is 

u(x,  t)  =  u(x,  0)  +  f  v(x,  r)dr  (5.4) 

Jo 

with  conditions  on  the  boundary  <9X  of  X  that  v,  and  hence  u  and  h,  equal  zero. 

To  remain  within  the  Eulerian  framework,  the  full  or  material  derivative  is  related  to 
the  partial  derivative  of  u  through  a  material  derivative  expansion 


rv 

v(x,t)  =  — u(x,i)  +  V[u(x,  t)fv(x,  f) 


(5.5) 


The  velocity  is  chosen  to  mimic  that  of  a  viscous  fluid  in  response  to  a  force  proportional 
to  the  steepest  ascent  direction  of  a  similarity  metric: 

pi V2v  +  //3VV  •  v  —  ^2v  +  b  =  0  (5.6) 


along  with  boundary  conditions  v  \qx=  0.  Equations  (5.5)  and  (5.6)  comprise  the 
fluid  model,  and  henceforth  (5.6)  will  be  referred  to  as  the  “original”  partial  differen¬ 
tial  equation  (PDE)  for  the  viscous  fluid  model.  The  model  is  C  =  £2  0  A,  where 
£2  =  px V2  +  ju3VV  •  —  p2,  Ci  =  +  v  •  V,  and  o  denotes  the  composition  of  £2  and 

Ci. 
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5.3  Discrete  Approximation  to  the  Fluid  Model 


The  displacement  field  of  (5.3),  when  approximated  in  the  time  domain  by  finite- 
differences,  is  quite  easily  translated  into  an  explicit  formula  for  the  displacement.  Based 
on  knowledge  of  the  velocity  and  displacement  at  the  previous  time  tk,  the  displacement 
at  time  tk+1  is 

u(x,£fe+1)  fa  u(x,tk)  +  A[v(x,tfe) 

— V[u(x,  tfc)]Tv(x,  £fc)]  (5.7) 

where  A  is  a  small  time  increment. 

This  move  to  a  discrete  approximation  to  the  solution  of  (5.3)  may  lead  to  singular¬ 
ities  in  the  transform.  In  order  to  guarantee  the  preservation  of  topology,  the  discrete 
Jacobian  of  the  transformation  is  determined  at  each  time  step.  The  template  is  regridded 
[25]  if  it  is  found  that  the  transformation  is  nearly  singular  at  any  point  in  the  domain. 
In  the  current  implementation,  a  new  template  is  propagated  when  the  Jacobian  at  any 
point  in  the  transform  drops  below  a  value  of  0.5.  A  larger  value  causes  more  templates  to 
be  propagated,  which  will  increase  numerical  precision  errors  of  the  transformation  due 
to  the  increased  number  of  concatenated  transformations;  conversely,  a  smaller  propaga¬ 
tion  threshold  causes  less  templates  to  be  propagated.  This  may  increase  the  numerical 
precision  errors  due  to  interpolating  transformations  that  are  nearly  singular  in  places. 

The  algorithm  for  a  discrete  template  deformation  using  the  fluid  model  is  as  follows: 

1.  Let  t=0,  i=0,  a^(x)  =  a(x),  and  u^(x)  =  0. 

2.  Calculate  the  body  force  b(x,  t)  as  in  Section  4.2. 

3.  If  b  is  below  a  threshold  for  all  x  G  X,  or  if  the  maximum  number  of  iterations  is 
reached,  then  STOP.  Otherwise,  proceed  to  step  4. 

4.  Solve  the  PDE  of  (5.6)  to  find  v. 

5.  Calculate  the  incremental  perturbation  of  the  displacement  field  by  solving  for  ^ 
in  (5.5)  based  on  r  =  v(x,  t)  —  V[u(x,tfe)]Tv(x,  tk). 
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6.  Choose  a  real-time  step  size  A  which  is  a  function  of  maxx  |v(x)|. 

7.  If  the  Jacobian  of  x  —  u  —  Ar(x)  is  less  than  0.5,  propagate  to  template  i  + 1  using 
ah+1)(X)t)  =  a^(x  —  uW(x,i)),  u(*+1)(x,  t)  =  0,  set  i  =  *  +  1,  and  go  to  step  5. 
Otherwise,  update  the  ith  displacement  field  using  u^(x,  t+ A)  =  u^(x,  f)+Ar(x), 
set  t  =  t  +  A,  and  go  to  step  2. 

5.4  Modified  Fluid  Model 

Two  modifications  are  made  to  the  fluid  model.  The  first  one  can  be  thought  of  as 
a  judicious  choice  of  /j,3  which  leads  to  a  more  intuitive  deformation  in  many  cases  and 
also  a  significant  reduction  in  the  computational  complexity.  Second,  the  complexity  is 
further  reduced  by  careful  selection  of  boundary  conditions. 

Insight  about  the  fluid  model  is  gained  by  realizing  that  (5.6),  using  discrete  time 
steps,  corresponds  to  rendering  stationary  the  functional  [64] 

F(v)  =  i 

+//2|v|2  —  2v  •  b]dx3  (5.8) 

The  first  term  penalizes  changes  in  the  velocity  field,  the  second  term  penalizes  diverging 
velocities,  and  the  third  term  penalizes  large  velocities.  The  final  term  penalizes  mis¬ 
alignment  between  study  and  template;  for  the  purposes  of  this  thesis,  b  is  a  function  of 
the  MI  between  the  images  as  explained  in  Section  4.2. 

Regularization  of  the  divergence  of  the  velocity  field  in  (5.8)  may  lead  to  deformations 
which  are  orthogonal  to  the  applied  force.  An  example  of  the  orthogonal  displacements 
is  shown  in  Figure  5.3.  A  force  as  applied  in  Figure  5.3(a)  may  produce  a  deformation 
as  shown  in  Figure  5.3(b).  For  different  values  of  Hi,  /i2,  and  /x3,  however,  the  same 
force  could  produce  a  different  deformation  as  shown  in  Figure  5.3(c).  Finally,  for  some 
shapes,  this  orthogonal  coupling  is  undesirable  all  together.  For  example  suppose  it  is 
desired  to  deform  the  square  of  Figure  5.4(a)  to  a  rectangle  shape.  Forces  applied  as 
shown  in  Figure  5.4(a)  would  result  in  a  deformation  shown  in  Figure  5.4(b). 


f [,ii|Vv|! 


+  ^3  V  •  V 
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Figure  5.3  Demonstration  of  orthogonally  induced  displacements.  Applying  a  force  in 
one  direction,  (a),  may  introduce  orthogonal  displacements  as  in  (b)  or  (c). 


b(x) 


(a) 


u(x) 

11 


(b) 


Figure  5.4  A  second  demonstration  of  orthogonally  induced  displacements.  Applying  a 
force  in  one  direction,  (a),  may  introduce  orthogonal  displacements  as  in  (b). 


The  amount  of  orthogonal  coupling  is  regulated  by  /i3.  Interestingly,  although  the 
incremental  deformations  may  be  noticeably  different  for  different  values  of  /i3,  experience 
has  shown  the  final  deformed  result  to  be  quite  insensitive  to  a  wide  range  of  Letting 
/r3  =  0,  a  much  simpler  PDE  emerges: 

Hi  V2v  -  /x2v  +  b  =  0  (5.9) 

which  is  now  actually  a  set  of  three  uncoupled  scalar  PDEs.  The  numerical  solution  of 
(5.9)  is  generally  much  faster  than  that  of  (5.6)  because  the  number  of  simultaneous 
unknowns  being  found  is  reduced  by  a  factor  of  three.  Equation  (5.9)  is  referred  to  as 
the  “modified”  PDE. 

In  order  to  further  render  (5.9)  amenable  to  numerical  analysis,  consider  the  extension 
of  the  image  and  template  from  the  unit  cube  to  all  of  R3 .  Additionally,  consider  the 
boundary  condition  on  the  velocity  field  lim^oo  |v(x)|  =  0.  In  this  case,  the  response 
of  the  velocity  field  to  a  force  is  translationally  invariant ,  meaning  that  the  response  is 
a  function  of  the  distance  between  the  applied  force  and  response  only,  not  the  absolute 
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location.  In  this  case,  the  velocity  field  may  be  explicitly  expressed  as 


v(x)  =  f  G(x  -  x')  •  b(x')dx'  (5.10) 

Jx 

where  the  dyadic  Green’s  function  (impulse  response)  is  given  by 

G(x  -  x')  =  -I— —  exp  (~x[^\x  -  x'|  J  log  |x  -  x'|  (5.11) 

2*1*1  \  y  fJ*i  J 

In  the  above,  I  is  the  2-D  identity  tensor  of  rank  2.  For  three-dimensional  data,  the 
dyadic  Green’s  function  would  be 


G(x  -  x') 


t  i  exp(-y/gix-x'i) 

27T^i  I  x  -  x' 


(5.12) 


where  I  is  the  3-D  identity  tensor  of  rank  2. 

Because  (5.11)  is  nothing  more  than  a  convolution,  it  may  be  evaluated  in  the  spectral 
domain  as  a  contraction: 


v(x)  =  T~'  p-(G(x))  •  ^(b(x)))  (5.13) 

where  JF(-)  is  the  spatial  Fourier  transformation  operator.  Finding  the  velocity  field  by 
evaluating  the  Fourier  integrals  in  (5.13)  with  fast  transforms  has  important  advantages 
over  directly  solving  the  PDEs  in  (5.6)  and  (5.9).  Because  of  the  rather  large  number  of 
unknowns  associated  with  the  solution  to  these  PDEs  for  image  resolutions  of  practical 
interest,  it  is  necessary  to  resort  to  iterative  solution  methods.  The  most  popular  of  these 
is  perhaps  successive  over-relaxation  (SOR)  as  used  in  [25],  which  requires  careful  selec¬ 
tion  of  a  relaxation  parameter  for  efficient  performance.  The  spectral  approach  embodied 
in  (5.13)  is  parameter-free,  requires  no  tuning,  and  has  computational  complexity  which 
scales  more  favorably  with  image  size  than  that  of  the  SOR  [65]. 

The  resulting  transform  model  produces  transformations  which  are  homeomorphisms. 
The  proof  is  similar  in  nature  to  a  proof  for  diffeomorphisms  in  [47].  The  steps  are  as 
follows:  (1)  Show  that  v(x,  t)  is  continuous  on  X.  (2)  Show  that  h(x,  t)  is  continuous  on 
X  and  t.  (3)  Show  that  h  is  one-to-one  and  onto. 
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Proof  that  v  is  continuous  is  done  by  contradiction.  Suppose  that  v  is  not  continuous. 
Then,  from  the  PDE  of  (5.9)  b  would  be  unbounded,  since  the  derivatives  of  noncontin- 
uous  functions  are  not  continuous  and  unbounded.  But  from  the  definition  of  the  body 
force  in  Section  4.2,  the  body  force  is  bounded,  so  v  must  be  continuous  on  X. 

Based  on  (5.2)  and  (5.4),  h  can  be  written  as 


h(x,  t)  =  x  —  x(x,  0) 


(5.14) 


It  is  straightforward  that  h  is  continuous  on  x  since  it  is  a  summation  of  functions  that 
are  continuous  on  x.  From  Definition  5.1,  h  is  continuous  on  t  since  v  is  bounded,  say, 
by  Q. 


I  h(x,t2)  —  h(x,ti)  |  =|  f  v(x,r)dr| 

(5.15) 

Jtl 

<  /  1  v(x> T)  1  dr 

(5.16) 

Jti 

<  {h  ~  U)q 

(5.17) 

From  the  existence  theorem  for  ODEs  [63],  the  solution  to  (5.4)  exists  and  is  unique. 
Additionally,  for  a  fixed  t ,  consider  a  complimentary  velocity  field  v(x,  r)  =  — v(x,  t  —  r ) 
with  solution  u(x,  t)  to  the  ODE 

^hid  =  v(x,r)  (5.18) 

dr 

and  boundary  conditions  v(x)  =  0,  Vx  €  dX  and  h(y,r)  =  y  —  u(y,r)  =  y,  Vy  ^  X. 
This  solution  also  exists  and  is  unique.  Then,  for  all  x  G  X  we  have  h(h(x), i)  =  x  and 
h(h(x),r)  =  x.  This  implies  that  h(x,  t)  is  one-to-one  with  inverse  h-1  =  h. 

Finally,  suppose  that  for  some  t  and  x  G  X,  h(x,  t)  =  y  ^  X,  then  by  definition 
of  h,  h(y,  t)  =  y,  but  this  cannot  be  since  it  has  already  been  shown  that  h(y,  t)  — 
h(h(x,  t),t)  =  x.  Therefore,  h  must  also  be  onto. 

The  algorithm  for  a  discrete  template  deformation  using  the  modified  fluid  model  is 
the  same  as  that  of  Section  5.3.  In  step  4.,  solution  to  the  PDE  of  (5.9)  is  done  using 
(5.13). 
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5.5  Fluid  Model  as  a  Prior  Model 


As  mentioned  at  the  introduction  of  this  chapter,  the  fluid  model  attempts  to  maintain 
deformations  which  are  meaningful  or  likely.  This  corresponds  to  the  prior  information 
about  a  which  was  neglected  in  (4.60).  The  model  £  =  Hi  V2  —  /x2  is  directly  related  to  a 
prior  probability  through  the  use  of  the  cost  functional  (5.8)  and  the  Gibbs  distribution 
(see  Section  3.1.2).  The  functional  corresponding  to  the  differential  operator  of  (5.9)  can 
be  related  to  a  prior  probability  through  the  Gibbs  distribution.  Since  a  =  a(h(x))  = 
a(x  —  u(x))  it  follows  that 


^951 

II 

.cr 

II 

=^exp{  “  \jx  MVvf  +  Hvl2dx} 

(5.19) 

1  (  1  f  r  |_  du  2  du  2  \ 

-zexp\-5./*H  *|  ) 

(5.20) 

Analysis  of  (5.20)  shows  that  for  a  discrete  u, 

P(v,i  |  Mj,  V)  #  i)  =  P(ui  |  Uj,Vj  adjacent  to  i)  (5.21) 

Equation  (5.21)  is  the  definition  of  a  Markov  random  field.  Thus  the  fluid  model  is 
capable  of  capturing  only  the  first-order  dependencies  between  pixels.  In  fact,  it  is  the 
case  that  almost  all  PDE-based  deformable  methods  capture  only  the  first-  or  second- 
order  dependencies  between  pixels. 

5.6  Implementation  Issues 

There  are  three  constants  associated  with  the  template  deformation:  /q,  //2,  and  A. 
From  (5.11)  and  (5.7)  we  note  that  there  is  no  loss  of  generality  in  choosing  fi i  =  1,  since 
any  ratio  of  ^2/^1  can  be  obtained  by  proper  selection  of  /x2  and  the  scale  of  v  can  be 
manipulated  using  the  step  size  A.  There  is  a  trade-off  in  selecting  A  since  for  smaller 
step  sizes,  the  approximation  in  (5.7)  is  more  accurate  but  takes  longer  to  deform.  We 
use  A  =  l/(maxx(|v(x)|).  With  a  fixed  // 1,  /i2  will  stipulate  the  relative  importance 
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between  limiting  the  gradient  of  v  and  v  itself.  Lower  values  of  /x2  result  in  a  smoother 
deformation,  while  higher  values  lead  to  a  more  fluid  deformation.  No  attempt  has  been 
made  to  determine  the  optimum  value  of  fj,2,  but  values  in  the  range  50-1000  are  typically 
selected. 


5.7  Experimental  Validation 

Figure  5.5  demonstrates  that  the  original  PDE  introduces  displacements  orthogonal 
to  the  direction  of  the  body  force.  In  this  figure  a  square  was  deformed  to  a  rectangle. 
Figure  5.5(a)  was  deformed  using  the  original  PDE,  and  Figure  5.5  was  deformed  using 
the  modified  PDE. 


(a)  (b) 


Figure  5.5  A  third  demonstration  of  orthogonally  induced  displacements.  A  square 
partially  deformed  to  a  rectangle  using  (a)  the  original  fluid  PDE,  (b)  the  modified  PDE. 
(a)  shows  a  displacement  in  the  top  and  bottom. 

The  final  results  for  both  the  original  and  modified  fluid  models  are  nearly  identical. 
Figure  5.6  shows  a  comparison  of  both  methods. 

Deformations  resulting  from  the  viscous  fluid  model  are  expected  to  preserve  the 
topology  of  the  template  while  allowing  large,  curved  deformations.  Figure  5.7  depicts 
the  result  of  a  patch  to  “C”  [25,  66]  deformation  driven  by  a  least-squares  similarity 
metric.  In  this  example,  it  is  clear  that  removal  of  the  divergence  regularization  for 
the  velocity  field  from  (5.8)  and  modification  of  the  boundary  conditions  used  to  yield 
the  explicit  expression  for  the  velocity  field  (5.13)  have  not  removed  the  ability  of  the 
resulting  deformations  to  accommodate  large,  curved  displacements. 
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(a) 


(b) 


(c) 


(d) 

Figure  5.6  Comparison  between  the  original  and  modified  fluid  models,  (a)  Template, 
(b)  study,  (c)  deformed  template  using  original  PDE,  (d)  deformed  template  using  mod¬ 
ified  PDE. 


(a)  (b)  (c)  (d)  (e) 

Figure  5.7  Illustration  of  the  fluid  transform’s  ability  to  accommodate  large,  curved 
deformations,  (a)  Template,  (b)  study,  (c)-(e)  deformed  template  after  50,  100,  150  time 
steps. 
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Linear  Size  of  Image  (Pixels) 


Figure  5.8  Average  time  required  to  solve  for  the  instantaneous  velocity  field  in  the 
course  of  the  patch  to  “C”  experiment  at  various  image  sizes. 

Attention  is  now  turned  to  quantitative  measures  of  the  efficiency  of  the  proposed 
technique.  Figure  5.8  reports  the  average  CPU  time,  in  CPU  seconds  as  measured  on  a 
500-MHz  ev56  DEC  Alpha,  for  the  numerical  solution  of  (5.6)  via  SOR  (“original  pde”), 
the  numerical  solution  of  (5.9)  via  SOR  (“modified  pde”),  and  the  spectral  evaluation 
of  the  velocity  field  via  (5.13)  (“spectral  solution”).  In  all  cases,  the  time  reported  is 
averaged  over  the  course  of  a  patch  to  “C”  experiment  (Figure  5.7)  performed  at  various 
image  resolutions.  For  the  PDE  (5.6),  it  was  found  that  the  relaxation  parameter  uj  «  1.9 
yielded  optimal  results.  For  the  PDE  (5.9),  it  was  found  that  the  relaxation  parameter 
oj  1.5  yielded  optimal  results.  The  spectral  solution  method  was  implemented  with 
fast  transforms  provided  by  the  FFTW  [67]  package,  which  automatically  determines  the 
optimal  FFT  algorithm  for  a  given  data  size  and  machine  architecture. 

It  is  clear  from  Figure  5.8  that  the  proposed  spectral  technique  is  indeed  faster  than 
the  SOR  method  at  solving  either  (5.6)  or  (5.9).  Additionally,  it  can  be  seen  that  the 
greatest  advantage  is  achieved  by  eliminating  the  penalty  on  the  divergence  of  the  velocity 
field.  This  accounts  for  the  difference  between  the  times  indicated  as  “original  pde”  and 
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a 


Linear  Size  of  Image  (Pixels) 


Figure  5.9  Peak  time  required  to  solve  for  the  instantaneous  velocity  field  in  the  course 
of  the  patch  to  “C”  experiment  at  various  image  sizes. 

“modified  pde.”  One  surprising  feature  is  that  the  spectral  technique  appears  to  scale 
no  more  favorably  than  the  SOR  solution  of  (5.9).  This  is  explained  by  the  fact  that 
in  the  SOR  procedure,  the  initial  guess  has  a  strong  influence  on  the  resulting  number 
of  iterations.  In  the  case  of  modeling  a  time-variant  velocity  field,  the  field  found  at 
the  previous  discrete  time  step  served  as  an  excellent  initial  guess  for  the  velocity  at  the 
current  time  step.  This  may  not  be  the  case  in  more  complicated  images. 

Figure  5.9  reveals  the  expected  superiority  in  the  scaling  of  the  spectral  technique 
over  the  SOR  by  examining  the  maximum  CPU  times  encountered  in  the  course  of  the 
patch  to  “C”  experiment.  In  these  “worst-case”  scenarios,  the  SOR  loses  the  benefit  of 
having  a  very  good  initial  guess.  All  times  are  again  in  CPU  seconds  as  measured  on  a 
500-MHz  ev56  DEC  Alpha. 

The  performance  of  the  modified  technique  has  been  examined  on  a  3-D  example. 
Figure  5.10  depicts  the  result  of  a  3-D  patch  to  “C”  experiment.  The  experiment  took 
place  on  a  128  x  128  x  32  grid  and  was  completed  in  under  one  hour  on  a  500-MHz  DEC 
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(e)  (f) 

Figure  5.10  Three-dimensional  patch  to  “C”  experiment,  (a)  Template,  (b)  study, 
(c)-(f)  template  at  various  stages  of  the  deformation. 

Alpha  ev5  CPU.  Because  the  template  is  somewhat  “thinner”  than  the  target,  truly  3-D 
deformations  are  induced. 

Performance  of  the  modified  fluid  model  with  mutual  information  has  also  been  exam¬ 
ined  on  medical  images.  Figure  5.11  shows  a  magnetic  resonance  (MR)  proton  density 
(PD)  image  from  the  Visible  Human  Project,  a  template  and  the  resulting  deformed 
template.  The  deformed  template  structure  closely  resembles  the  study  image  structure. 
Notice  the  ability  of  the  deformation  to  track  large,  curved  deformations. 

Figure  5.12  shows  a  second  demonstration  of  the  deformation  process  for  images  with 
significantly  different  contrasts  than  Figure  5.11(a).  In  this  case  the  top  row  shows 
deformation  to  an  MR  T-l  weighted  image,  and  the  second  row  shows  deformation  to  a 
cryosection  image.  The  same  template  and  modified  fluid  deformation  algorithm  is  used, 
and  no  preprocessing  is  necessary.  Although  Figure  5.12(a)  is  noticeably  noisier  than  the 
other  images,  the  algorithm  still  performs  reasonably  well. 
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(a)  (b)  (c) 

Figure  5.11  Template  deformation  to  an  MR-PD  medical  image,  (a)  Study,  (b)  tem¬ 
plate,  (c)  deformed  template. 


(a) 


(b) 


(c) 


(d)  (e)  (f) 

Figure  5.12  Deformation  to  MR-T1  and  cryosection  medical  images,  (a)  Study,  (b) 
template,  (c)  deformed  template,  (d)  Study,  (e)  template,  (f)  deformed  template. 
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Note  that  a  large  portion  of  the  white  matter  (corresponding  to  the  brightest  portion 
of  the  template)  at  the  lower  left  portion  of  the  brain  in  Figure  5.12(a)  is  misclassified. 
This  is  partially  due  to  initial  alignment  of  the  template,  and  also  due  to  a  topology  dif¬ 
ference  between  the  template  and  study.  For  the  deformation  of  the  template  to  5.12(d), 
the  template  intensity  corresponding  to  skull  (third  darkest)  over-deforms  outward  at  the 
top  of  the  image.  This  is  a  result  of  a  slight  gradient  in  the  image  intensity  values  from 
top  to  bottom.  This  is  most  likely  due  to  lighting  conditions  at  image  acquisition. 

Discussions  earlier  in  this  chapter  and  in  Chapter  4  mentioned  the  theoretical  impact 
of  changes  in  the  Parzen  window  width  used  in  the  body  force  calculation  and  changes 
in  H2  from  (5.9).  Figures  5.13  though  5.15  show  results  for  various  settings  of  these 
parameters. 

Section  4.1.3  stated  that  Parzen  windowing  should  reduce  partial  volume  and  noise 
effects.  In  Figure  5.13  the  first  column  shows  the  study  images,  the  second  column  shows 
deformation  using  only  normalized  histograms,  and  deformations  in  the  third  and  forth 
columns  use  Gaussian  Parzen  windows  with  standard  deviation  5  and  10,  respectively. 
The  results  show  that  noise  effects  and  partial  volume  effects  are  reduced.  Figure  5.13(e) 
is  quite  noisy.  The  boundary  between  skull  and  scalp  is  quite  jagged  in  (f)  due  to  the 
noise,  but  these  effects  are  reduced  in  (g)  and  (h).  The  finger  like  regions  of  the  white 
matter  typically  suffer  more  from  partial  volume  effects  than  other  regions  of  the  images. 
Comparison  between  columns  of  the  fingerlike  white  matter  regions  shows  that  Parzen 
windowing  allows  the  deformation  to  more  fully  extend  the  fingers  in  some  cases. 

Figures  5.14  and  5.15  show  image  deformations  as  /i2  varies  between  50, 100,  and  1000 
for  columns  2,  3,  and  4,  respectively.  Larger  values  of  ^2  lead  to  more  fluid  transforms. 
As  the  model  becomes  more  fluid,  the  deformation  is  more  capable  of  capturing  the  large 
nonlinear  variations  in  the  white/gray  matter  boundaries.  However,  other  parts  of  the 
template  are  more  likely  to  over-deform. 
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(a)  (b)  (c)  (d) 


(i)  (j)  00  (1) 

Figure  5.13  Deformation  to  several  images  using  Parzen  windows  of  differing  widths 
when  calculating  the  body  force.  From  left  to  right,  columns  show  the  study,  histogram- 
based  probability  estimation,  Parzen  estimation  with  a  =  5,  and  Parzen  estimation  with 
a  =  10. 
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(a)  (b)  (c)  (d) 


(e)  (f)  (g)  00 


(i)  (j)  00  0) 

Figure  5.14  Deformation  to  several  images  using  various  values  for  fi2  in  (5.9).  From 
left  to  right,  columns  show  the  study,  and  jj,2  =  50,  100,  1000. 


87 


Figure  5.15  Deformation  to  several  other  images  using  various  values  of  /z2.  From  left 
to  right,  columns  show  the  study,  and  /j2  =  50,  100,  1000. 
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5.8  Summary 


A  fluid  deformable  model  has  been  presented.  The  model  maintains  the  topology 
of  the  template  while  allowing  large,  nonlinear  deformations.  The  model  is  theoretically 
and  experimentally  faster  than  the  fluid  model  proposed  elsewhere.  Parameters  requiring 
manual  selection  were  a  fluid  parameter  yu2  and  the  Parzen  window  width  a.  A  value 
of  a  =  5  was  found  to  perform  well.  When  /i2  was  selected  to  make  the  deformations 
sufficiently  fluid  to  capture  the  variability  in  the  white/gray  matter,  the  model  tended  to 
over-deform  in  other  areas.  The  PDE  imposes  prior  constraints  about  smoothness  and 
topology  on  the  template.  The  constraints  were  shown  to  be  Markovian  in  nature.  Use 
of  more  specific  shape  information  is  studied  in  Chapter  6. 
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CHAPTER  6 


INCLUSION  OF  SHAPE  INFORMATION 


In  allowing  large  curved  deformations  of  the  template,  the  fluid  model  also  allows  over¬ 
deformation  of  regions.  A  possible  solution  is  to  determine  an  allowable  displacement  for 
each  element  of  the  template;  however,  the  displacements  are  not  independent  of  each 
other.  In  fact,  most  interesting  images,  including  brain  images,  contain  dependencies 
between  distant  pixels.  For  example,  the  position  of  an  element  on  one  side  of  the 
skull  may  be  coupled  with  a  symmetric  point  on  the  opposite  side  of  the  image.  These 
dependencies  are  critical  to  the  generation  of  templates  consistent  with  the  objects  they 
represent,  but  they  are  not  well  modeled  using  PDEs  as  developed  in  Chapter  5. 

From  an  information-theoretic  viewpoint,  promoting  likely  variations  corresponds  to 
minimizing  the  information  of  the  deformed  template  /( a(x))  (see  Section  4.5.1).  Section 
5.5  showed  that  the  fluid  model  is  capable  of  modeling  only  the  Markov  dependencies  in 
the  template.  This  chapter  discusses  modeling  dependencies  which  are  higher  order  than 
Markov. 

The  shape  modeling  presented  in  this  chapter  is  based  on  a  point  distribution  model 
developed  by  Cootes  et  al.  [40].  The  point  distribution  model  is  summarized,  followed 
by  the  presentation  of  a  model  for  capturing  global  dependencies  of  a  deformed  template 
and  a  method  for  including  the  shape  information  in  the  deformation  process. 
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6.1  The  Point  Distribution  Model 


Cootes  et  al.  [39,  40]  developed  a  model  to  capture  meaningful,  highly  correlated  vari¬ 
ations  among  training  samples  and  represent  them  in  a  compact,  intuitive  form:  the  point 
distribution  model  (PDM).  The  approach  is  to  represent  these  dependencies  in  terms  of 
an  orthonormal  basis  set.  Deformations  are  then  restricted  by  simple  independent  limits 
on  model  parameters.  The  following  is  the  PDM  as  described  in  [39,  40,  42], 

A  shape  is  defined  to  be  an  equivalence  class  for  which  instances  of  the  class  show  a 
high  degree  of  similarity  through  the  operation  of  a  similarity  group  (such  as  rotation, 
shift  and  scale).  For  purposes  of  the  PDM,  all  shape  instances  must  have  have  the  same 
number  of  points  A'*,  and  must  be  in  the  same  order.  For  convenience,  the  point  order 
usually  follows  sequentially  around  the  shape,  but  this  is  not  necessary.  At  times,  a  shape 
instance  is  also  referred  to  as  a  shape. 

Mathematically  speaking,  let  q  =  represent  a  shape  instance.  The  similarity 

between  q  and  an  exemplar  shape  q*  is  measured  by  [42] 

Nk 

e  =  -  9**,i t)2  + 

k= 1 

(Tfe,*)  -  (6.1) 

where  T  is  the  set  of  affine  transforms  compensating  for  all  possible  translations,  scales, 
and  rotations  of  the  shape  denoted  by  T. 

Given  a  set  of  Ni  shape  instances,  define  T*  to  be  the  T  which  minimizes  e  for  the  ith 
shape  instance.  Each  normalized  shape  instance  7J*(q i)  can  be  represented  as  a  2Nk~D 
vector.  When  all  normalized  training  examples  are  plotted  in  the  2Afc-D  space,  they  will 
form  a  cluster  of  points.  This  cluster  lies  in  a  region  of  space  called  the  allowable  shape 
domain  (ASD).  The  PDM  models  the  coordinate  variations  within  the  clusters.  Assuming 
linear  dependence  on  the  coordinates,  the  ASD  will  be  approximately  ellipsoidal  and  the 
size  will  be  approximately  described  by  the  cluster  of  training  examples.  Every  vector 
within  the  ASD  will  correspond  to  a  set  of  2-D  points  whose  shape  is  broadly  similar  to 
those  in  the  training  set. 


91 


(6.2) 


The  center  of  the  ASD,  or  the  average  shape,  is  calculated  using 

1  %—\ 

The  ASD  can  be  described  using  a  basis  set  corresponding  to  its  principal  axes.  Each 
axis  gives  a  mode  of  variation,  a  way  in  which  the  2-D  shape  points  tend  to  move  together 
as  the  overall  shape  varies.  The  principal  axes  are  found  using  principal  component 
analysis  (PCA)  [68].  Eigenvalues  and  eigenvectors  are  found  for  a  covariance  matrix 

(6-3> 

yVi  <= i 

The  modes  of  variations  within  the  training  set  are  described  by  the  unit  eigenvectors  p*, 
and  the  variance  explained  by  each  eigenvector  is  equal  to  the  corresponding  eigenvalue 
A k-  Any  point  in  the  ASD  ellipsoid  can  be  reached  by  adding  a  linear  combination  of  the 
eigenvectors  to  the  mean  shape.  It  turns  out  that  almost  all  of  the  shape  variations  can 
be  explained  using  the  no  principal  modes  of  variation;  thus,  any  shape  instance  can  be 
approximated  using  the  mean  shape  and  a  linear  combination  of  the  first  n0  eigenvectors. 
Define 

S=  (pi,P2,-..  ,Pn0)  (6-4) 

Then  shape  instance  q  is  approximated  by 

q»(r)-1ft  +  ?9)  (6.5) 

where  f3  is  a  vector  of  weights.  Due  to  the  orthonormality  of  the  eigenvectors,  (3  can  be 
found  by 

0,  =  E^Cnq)  -  q)  (6.6) 

Suitable  limits  for  (3  are  typically  on  the  order  of 

— 3\/A ~k  <  Pk<  Sy/\ ~k  (6.7) 

since  most  of  a  population  lies  within  three  standard  deviations  of  the  mean.  The  PDM, 
then,  defines  a  shape  using  the  mean  shape  instance  and  parameters  E  and  A. 
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6.2  The  Shape  Model 


A  single  PDM  is  ill-suited  to  directly  model  all  allowable  variations  in  a  template.  This 
is  due  to  two  factors.  First,  large,  nonlinear  deformations  of  the  template  are  allowed, 
but  they  are  not  well  modeled  using  the  PDM.  Second,  the  dimensionality  of  the  ASD 
for  an  entire  template  is  cumbersome.  Instead,  PDMs  are  created  for  important  subsets 
of  the  template.  Dividing  a  complex  shape  into  several,  more  well-behaved,  subshapes 
is  also  problematic  if  correlation  is  ignored  between  points  in  different  subshapes.  For 
contour-based  deformations,  topology  between  subshapes  is  not  necessarily  conserved. 
A  conflict  could  occur,  for  example,  if  image  points  belong  to  more  than  one  subshape 
[42].  However,  because  the  deformation  of  this  thesis  is  volumetric  rather  than  contour- 
based,  multiple  subshapes  can  be  simultaneously  deformed.  The  fluid  properties  of  the 
deformation  account  for  the  topological  constraints  between  subshapes. 

Let  Xj  C  X  be  the  jth  of  Nj  subsets  of  pixel  locations  in  the  template.  Each  subset 
consists  of  Njtk  points  in  X.  Through  the  deformation  process  these  points  are  relocated 
to  q  =  h-1(Xj)  C  X.  The  overall  shape  which  will  be  modeled  is  q  =  (JVy  Qi *  The 
subsets  are  selected  such  that  expected  variations  are  linear  in  nature,  and  each  q  has 
a  high  degree  of  regularity  or  shape  as  determined  by  e  in  (6.1).  However,  variations  in 
q  may  not  be  linear  in  nature. 

Let  T*  be  the  Tj  which  minimizes  (6.1)  for  subshape  q.  For  a  general  2-D  transform 
allowing  all  possible  shifts,  rotations,  and  scales,  each  7~*  can  be  described  by  four  scalars 
and  put  in  vector  form  tj.  The  transform  vectors  for  each  subset  can  be  concatenated  into 
a  single  vector  t*  =  t(  (tl)T,  (t*2)T, ...  ,  (t*N.)T  )T  of  length  4Nj.  A  rough  approximation 
to  the  shape  is  obtained  using  only  the  mean  subshapes  and  the  rigid  transforms. 

q  «  LK7?)"1^)  (6-8) 

Vi 

The  information  content  of  the  deformed  template  is  roughly  approximated  by 

7(a(x))  «  J(q)  =  J(f)  (6.9) 
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In  addition  to  the  affine  variations  in  the  subshapes,  there  are  nonaffine  variations 
which  preclude  e in  (6.1)  from  being  zero.  These  variations  are  mathematically  described 
by  Tj{qj)  —  q*.  Ensuring  that  the  generated  deformations  also  follow  likely  nonaffine 
transformations  leads  to  a  more  accurate  approximation  of  7(a(x))  by 

/( 5(x))  *  1(f)  +  £  /( Tfa)  -  qj)  (6.10) 

j~  1 


6.2.1  Model  statistics  from  training  samples 


Using  the  model  of  (6.10),  the  dimensionality  of  the  probability  space  has  been  re¬ 
duced  from  2 NxNy  to  ANj  +  Y^j=\  2Arj,/c,  where  Nj)k  is  the  number  of  points  in  subshape 
j.  The  shape  model  can  be  made  further  amenable  by  the  use  of  the  PDM.  The  PDM 
is  used  to  model  both  the  intrashape  variations  (nonaffine  variations  within  a  subshape) 
and  the  intershape  variations  (correlations  between  affine  transform  parameters). 

Consider  a  training  set  of  Ni  deformed  templates  {a,},  i  =  1, . . .  ,  Nt  from  which  q^- 
are  obtained.  Subscript  i  corresponds  to  the  training  sample,  j  to  the  subshape,  and 
k  to  a  particular  point  in  the  subshape.  Let  t*  be  the  parameter  vector  for  the  affine 
transforms  which  minimize  (6.1)  for  all  subshapes  of  the  ith  training  example.  Then 
define  the  mean  of  the  t*  as 


i  Ni 


2—1 


For  each  transform,  its  deviation  from  the  mean  is 


(6.11) 


dt  i  =  U  - 1 


and  the  covariance  matrix  is  calculated  for  the  relative  position  using 

1  Ni 

St  =  — (6.12) 

*  i= 1 

Let 


“t  —  (P(,l  ?  Pt,2>  •  •  •  )Pi,nt) 


(6.13) 
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be  the  matrix  containing  the  nt  principal  eigenvectors  of  Xt  is  the  vector  of  corre¬ 
sponding  eigenvalues. 

Intrashape  variations  are  captured  in  a  manner  similar  to  that  described  above  for 
the  intershape  variations.  For  each  of  the  Nj  subshapes,  a  mean  shape  q^,  a  principal 
eigenvector  matrix  Ej,  and  a  principal  eigenvalue  vector  A j  are  found. 

Any  shape  can  be  approximately  described  using  the  mean  transform  vector,  the  mean 
subshapes,  and  a  weighted  sum  of  the  eigenvectors  found  in  Et  and  Ej,  j  =  1, . . .  ,  Nj. 
The  weights  corresponding  to  the  eigenvectors  are  f3t  and  (3j.  When  f3t  and  /3^  are  all 
zero  vectors,  the  resulting  shape  is  the  average  shape. 

Once  the  model  has  been  trained  using  the  training  data,  the  mean  subshapes,  eigen¬ 
values,  and  eigenvectors  are  deterministic.  The  only  uncertainty  in  the  shape  is  found  in 
the  (3t  and  0j.  Thus,  (6.10)  becomes 

Nj 

WzL))Kl(fit)  +  'Z,H0s)  (6'14> 

i= i 

Notation  is  slightly  simplified  by  letting  /3t  and  At  correspond  to  the  j  =  0  subshape. 
Equation  (6.14)  simplifies  to 

Nj 

J(a(x))  «£/(!»,)  <6'15) 

3= 0 

The  most  likely  templates  are  generated  when  (6.14)  is  minimized.  This  is  done  by  min¬ 
imizing  the  /3  parameters.  Since  the  individual  elements  of  f5j  are  assumed  independent, 
they  can  be  varied  independently  of  each  other.  Acceptable  ranges  are  determined  by 
the  A j 


6.3  Inclusion  of  Shape  Information  into  the 
Deformable  Model 

Under  the  proposed  method,  the  learned  shape  information  is  introduced  into  the 
deformable  process  as  an  additional  body  force  bp  as  shown  in  (1.4),  which  is  restated 
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here  for  convenience: 


£u  +  bd  +  bp  =  0  (6.16) 

where  b«*  is  the  body  force  from  the  mutual  information,  C  is  the  fluid  model,  and  u  the 
displacement  field. 

Calculation  of  bp  is  done  in  the  following  manner.  Let  q  describe  the  current  defor¬ 
mation  at  iteration  time  t.  The  associated  /3jtk  give  an  indication  of  the  prior  information 
contained  in  the  deformed  template.  Unless  /?,-,*  =  0,  V?',  k,  a  more  likely  shape  can  be 
found  by  reducing  some  or  all  of  the  f3jtk-  Denote  a  more  likely  shape  by  q  with  associated 
model  parameters  Following  are  some  desirable  properties  of 

1. 

(6.17) 


2. 


\hk\<%V\*,  Vj,k 


Pj,k 

< 

$m,n 

— > 

Pj,k 

< 

$7X1,71 

\/  ^m,n 

Pj,k 

$m,n 

4. 

sgn  (fa, k)  =  sgn(/3j,fc) 

An  equation  for  (3jtk  with  these  properties  is 

Pj,k  =  sgn /3  •  300  ^1  -  exp 

It  follows  that 

Qj  =  70  (ty  +  Sj^i) 


(6.18) 


(6.19) 


(6.20) 


(6.21) 


(6.22) 
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and  the  more  likely  shape  is 


Nj 

q  = 

3= 1 

Finally,  bp  is  given  by 

bP  =  //4(q  -  q) 


(6.23) 


(6.24) 


6.4  Implementation  Issues 


The  shape  method  just  described  introduces  one  constant,  fa  for  manual  selection, 
in  addition  to  those  already  explained  in  Section  5.6.  The  constant  fa  represents  the 
relative  emphasis  between  the  data-dependent  body  force  and  the  shape-dependent  body 
force.  For  all  experiments  shown  in  this  thesis  fa  was  chosen  such  that  the  ratio  between 
the  maximum  force  in  bp  and  the  maximum  force  in  b d  was  proportional  to  the  maxi¬ 
mum  Mahalanobis  distance  of  the  subshapes  (and  transform  vector)  from  their  means. 
Specifically, 


fa 


maxx(|v(x)|)  (ypf  flA  \ 

maxfeij(|q  — q|)  1  {£  \hJ  J 


(6.25) 


The  other  parameters  were  chosen  as  explained  in  Section  5.6. 

In  order  to  implement  the  proposed  model,  the  template  subsets  Xj  to  track  as  shapes 
must  be  determined.  As  mentioned  previously,  these  subsets  should  have  a  high  degree 
of  regularity  as  measured  by  (6.1),  and  the  variations  should  be  linear  in  nature.  In 
the  current  implementation,  these  points  are  manually  selected.  Future  research  could 
implement  a  more  automated  method  for  shape  selection. 

A  second  issue  is  that  each  training  example  q2  must  have  the  same  number  of  points 
k  for  all  i,  the  points  must  be  in  the  same  order,  and  they  must  have  the  same  start 
point.  Manually  segmenting  a  training  image  is  already  a  time  consuming  task,  and 
requiring  the  human  expert  to  also  count  and  order  points  along  the  segmentation  surfaces 
magnifies  the  difficulty  of  the  task.  Duta  et  al.  [69]  present  a  method  to  solve  this 
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problem  which  is  fully  automated  and  quite  robust.  The  approach  is  based  on  a  quasi- 
exhaustive  exploration  of  point  correspondences.  An  alternate  method  is  chosen  based  on 
the  deformation  of  the  template  a(x)  to  the  segmented  structure  Sj(x)  of  the  ith  training 
image.  Because  the  study  images  are  of  equivalent  topology,  a  transform  h;(x)  can  be 
found  such  that  a,(h,(x))  =  Sj(x).  Furthermore,  hj(x)  is  one-to-one,  so  each  point  in 
the  deformed  template  can  be  mapped  back  to  a  unique  point  in  the  original  template. 
Using  these  facts,  an  order  is  manually  specified  to  the  points  comprising  the  shape  in 
the  original  template.  The  template  is  deformed  to  all  segmentations  in  the  training  set 
(without  using  shape  priors),  using  hj(x)  to  keep  track  of  the  order  of  the  points  as  they 
deform. 

The  same  procedure  is  used  to  keep  track  of  the  shape  while  the  template  is  being 
deformed  to  a  study  image.  In  practice  enough  points  of  the  template  are  given  to 
describe  the  shape. 

The  Mi-based  body  force  contains  nonzero  values  on  both  sides  of  the  region 
boundaries  in  the  deformed  template.  It  is  important  that  either  bp  is  applied  to  both 
sides  of  the  boundaries  or  that  bd  is  consolidated  to  one  side  of  the  boundary.  Otherwise, 
bp  will  not  be  able  to  sufficiently  cancel  elements  of  bd  which  lead  to  over-deformation. 

It  is  assumed  that  the  template  will  remain  aligned  as  to  rotation  throughout  the  de¬ 
formation  process,  so  the  allowable  transform  space  T  for  normalizing  subshapes  consists 
of  three  scalars  specifying  a  scale  and  translation. 

6.5  Experimental  Results 

Experiments  were  conducted  to  test  the  proposed  method  for  segmentation  of  images 
with  widely  differing  contrasts.  The  training  set  was  comprised  of  32  2-D  axial  slices  of 
the  supraventrical  region  of  the  brain.  Images  were  obtained  from  the  VISIBLE  HUMAN 
project  and  through  local  sources.  The  images  were  manually  labeled  according  to  five 
regions:  white  matter,  gray  matter,  bone,  scalp,  and  background.  A  template  was  created 
and  deformed  to  match  each  of  the  training  image  segmentations.  The  subshapes  chosen 
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for  the  shape  model  were  the  outer  contour  of  the  scalp,  the  outer  contour  of  the  skull, 
and  the  inner  contour  of  the  skull.  Both  inter-  and  intrashape  variations  were  estimated 
as  described  in  Section  6.2.1.  Enough  principal  components  were  used  to  capture  95%  of 
the  variance. 

The  template  was  then  deformed  to  study  images  not  found  in  the  training  set  with  the 
intent  of  estimating  the  correct  segmentation  of  study  images.  Figures  6.1  and  6.2  show 
deformation  results  for  images  both  with  and  without  the  inclusion  of  shape  information. 
The  figures  illustrate  the  ability  of  the  shape  information  to  prevent  over-deformation 
in  areas  around  the  brain  while  still  allowing  the  large,  curved  deformations  required  to 
capture  white/gray  matter  boundaries. 


6.6  Summary 

A  shape  model  has  been  presented.  Subsets  of  points  in  the  template  are  chosen  that 
show  a  high  degree  of  regularity,  or  shape,  when  deformed.  Both  affine  variations  between 
subshapes  and  residual  variations  within  each  subshape  are  captured.  Point  distribution 
models  are  used  to  compactly  model  the  dependencies  between  point  and  subshapes.  By 
this  method,  the  shape  of  the  template  is  represented  by  an  average  shape  instance  and 
a  set  of  independent  parameters.  Attenuation  of  these  parameters  leads  to  more  likely 
shapes.  By  training  the  PDMs  on  a  set  of  deformed  templates,  the  number  and  ordering 
of  points  in  the  training  set  is  automatically  achieved. 

The  shape  information  is  included  in  the  deformation  process  through  an  additional 
body  force  component.  The  force  drives  the  template  toward  more  likely  shapes  based 
on  the  information  learned  in  the  shape  model.  An  equation  to  calculate  the  body  force 
was  derived  from  an  axiomatic  characterization  of  the  force. 

Experiments  were  conducted  to  assess  the  ability  of  the  shape  information  to  maintain 
acceptable  transformations  of  the  template.  The  inclusion  of  shape  information  was 
qualitatively  shown  to  maintain  acceptable  deformations  of  the  scalp  and  skull  contours, 
while  allowing  large,  nonlinear  deformations  of  the  white  matter  and  gray  matter. 
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(a)  (b)  (c) 


(d)  (e)  (f) 


(g)  00  0) 

Figure  6.1  Demonstration  of  the  effectiveness  of  shape  information  in  the  deformation 
process.  Column  1:  Study  images.  Column  2:  Deformation  without  shape  information. 
Column  3:  Deformation  with  shape  information.  The  shape  information  maintains  better 
control  of  scalp  and  skull  contours. 
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(a)  (b)  (c) 


(d)  (e)  (f) 


(g)  00  (0 

Figure  6.2  Demonstration  of  the  effectiveness  of  shape  information  in  the  deformation 
process  (continued).  Column  1:  Study  images.  Column  2:  Deformation  without  shape 
information.  Column  3:  Deformation  with  shape  information. 
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CHAPTER  7 


A  HYBRID  BRAIN  IMAGE  SEGMENTATION 

METHOD 


The  deformable  template  method  described  in  Chapters  4,  5,  and  6  has  been  used  as 
part  of  an  automatic  brain  image  segmentation  framework  being  developed  at  the  Uni¬ 
versity  of  Illinois  Beckman  Institute  [6] .  This  chapter  briefly  introduces  the  framework, 
shows  how  template  deformation  fits  in,  and  shows  experimental  results  for  white/gray 
matter  segmentation  in  MR  images. 

7.1  Brain  Image  Segmentation 

Over  the  last  decade,  many  methods  have  been  proposed  to  tackle  this  problem  [2,  59] 
including  low-level,  knowledge-based,  and  statistical  techniques.  Some  of  the  earliest 
image  segmentation  techniques  involved  primarily  low-level  computer  vision  techniques 
such  as  edge  detection  [59,  70,  71].  Edge  detectors  are  often  not  too  useful  for  brain 
image  segmentation  because  they  identify  local  edges  based  on  image  intensity  gradient 
information  alone,  and  as  a  result,  detected  edges  are  usually  not  well  connected  to  form 
anatomically  meaningful  regions.  Region-based  segmentation  methods  [72,  73,  74,  75, 
76,  77]  overcome  some  of  these  problems,  but  still  do  not  usually  contain  sufficient  prior 
knowledge. 

Knowledge-based  segmentation  techniques,  on  the  other  hand,  rely  primarily  on  prior 
knowledge  of  the  anatomical  structures.  Examples  include  model-based  methods  [78,  79] 
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Study  Image,  d(x) 


Template,  a(x) 


Output 


Figure  7.1  Flowchart  of  the  described  brain  image  segmentation  method. 


and  rule-based  methods  [80] .  Model-based  methods  fit  a  specific  geometric  model  of  the 
brain  to  a  given  data  set  to  determine  the  final  segmentation.  This  approach  works  well 
for  some  normal  brain  images  but  has  trouble  dealing  with  brain  abnormalities.  Rule- 
based  methods  use  “if-then”  type  heuristic  rules  for  pixel  clustering  and  their  practical 
utility  is  often  limited  due  to  inflexibility. 

Other  methods  attempt  to  incorporate  a  priori  information  in  a  flexible  manner  using 
a  statistical  model  [81,  82,  83]  or  a  neural  network  model  [84].  However,  many  computa¬ 
tional  problems  remain,  which  limit  their  practical  application. 

The  segmentation  method  described  here  takes  advantage  of  several  existing  meth¬ 
ods;  the  framework  employs  multiscale  region  segmentation,  image  normalization,  tem¬ 
plate  deformation,  and  statistical  classification  for  the  segmentation  task.  The  resulting 
method  can  be  viewed  as  performing  region  clustering  based  on  geometric  considerations. 
Figure  7.1  schematically  shows  the  overall  process  of  the  proposed  method.  The  template 
deformation  component  is  that  described  in  Chapters  4,  5,  and  6.  The  other  components 
are  described  below. 


7.1.1  Feature  extraction:  Multiscale  region  segmentation 

Due  to  the  nature  of  MR  imaging,  different  brain  tissue  types  tend  to  have  different 
intensities  in  the  images.  Intensities  cannot  be  used  exclusively  in  segmentation  because 
of  field  inhomogeneities,  partial  volume  effects,  and  noise.  However,  intensity  information 
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tends  to  be  a  very  useful  feature  in  the  segmentation  process.  The  purpose  of  multiscale 
region  segmentation  is  to  organize  the  intensity  information  by  grouping  image  pixels 
according  to  a  similarity  measure  based  both  on  their  spatial  relationship  and  intensity 
variations  at  different  scales. 

More  formally,  a  homogeneous  region  R(j),  j  =  0, 1, 2, . . .  with  homogeneity  scale  a 
is  a  set  of  connected  pixels  with  the  following  properties: 

1.  ||d(x1),d(x2)||  <  a,  Vxi,x2  G  R{j)-, 

2.  ||d(x!),  d(x2)|  >  <7,  Vxi  G  R(i),  Vx2  G  R(j),  i  ±  j 

where  ||-,-||  is  some  homogeneity  measure  which  accounts  for  both  intensity  differences 
and  spatial  distances  between  points.  One  possible  measure  is  the  absolute  intensity 
difference  between  connected  pixels,  given  by 

||d(xi),d(x2)||  =|  d(xi)  -  d(x2)  | 

Practically,  however,  applying  such  a  definition  directly  will  not  usually  give  meaningful 
regions  due  to  noise  and  other  sources  of  image  intensity  variations  [74,  75].  Most  existing 
methods  attempt  to  reduce  the  effect  of  the  noise  by  image  smoothing  techniques.  An 
alternate  approach  was  proposed  by  Ahuja  [76]  and  Tabb  and  Ahuja  [77].  This  method 
clusters  pixels  using  attraction  force  fields,  based  on  both  spatial  and  intensity  relation¬ 
ships  with  other  pixels  in  the  image,  and  explicitly  provides  for  multiple  scale  clustering, 
based  on  the  many  levels  of  sensitivity.  A  main  advantage  of  this  algorithm  is  that  it 
does  not  involve  smoothing  of  the  image  to  generate  different  scales,  so  boundary  detail 
is  maintained,  even  in  the  coarse  scales.  Moreover,  gradients  are  not  used,  so  noise  is  not 
amplified.  The  resulting  segmentations  are  organized  into  a  parent-child  tree  structure 
where  each  region  in  a  coarser  level  consists  of  several  subregions  in  the  finer  level.  The 
tree  structure  allows  for  faster  analysis  due  to  the  linking  of  information  between  scale 
levels. 


104 


To  perform  the  multiscale  analysis  on  the  study  image  d  a  vector  field  F(x,  ag,  as)  is 
created  for  each  desired  scale  by 


F(x,  og)  as)  =  dg(&G,  <?g)ds{±  -  x,  as) 

xear 

Here  ag  and  as  are  the  scale  parameters  for  graylevel  and  spatial  similarities,  x  and  x  are 
element  locations  in  d(x),  AG  is  the  difference  in  intensities  between  points  x  and  x,  dg 
is  the  intensity  distance  measure,  and  ds  is  the  spatial  distance  measure.  The  resulting 
vector  field  contains  a  vector  for  each  image  location  that  points  toward  other  image 
elements  to  which  it  is  ‘close’  in  both  intensity  and  position. 

The  intensity  distance  measure  dg  is  computed  using  a  boxcar  function  such  that 


dg{AG,ag ) 


1  AG  <  Og 
0  otherwise. 


(7.1) 


The  spatial  distance  measure  ds  is  also  a  boxcar  function  similar  to  (7.1). 

Tabb  and  Ahuja  [77]  state  that  having  two  scale  parameters  (ag  and  as)  is  overly 
general.  A  meaningful  value  for  as  can  be  adaptively  calculated  based  on  the  value  of 
ag  and  the  image  content.  The  goal  is  to  select  a  small  enough  region  of  interest  so  that 
important  image  information  is  not  lost  while  still  providing  meaningful  region  informa¬ 
tion.  Exact  details  are  provided  in  [77].  Using  this  adaptive  procedure,  F(x,ag,crs)  is 
reduced  to  F(x,  a)  where  a  is  the  original  ag. 

Region  information  is  extracted  from  F(x,  a)  by  realizing  that  the  vectors  within  the 
field  will  tend  to  be  perpendicular  to  region  boundaries  and  converge  toward  the  centers 
of  the  regions.  Region  boundaries  are  located  between  diverging  vectors  in  the  vector 
field.  The  regions  for  the  different  scales  are  organized  in  a  tree  structure  so  that  regions 
at  a  coarse  scale  are  composed  of  one  or  more  regions  in  the  next  finer  scale.  Figure  7.2 
shows  the  region  segmentation  for  three  different  scales. 

The  region  segmentation  is  also  useful  in  automatically  separating  the  head  image 
from  the  background  information.  An  object  silhouette  is  determined  by  selecting  a 
coarse  scale  and  performing  morphological  fill  operations  followed  by  a  closing  operation. 
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Figure  7.2  Multiscale  analysis  performed  on  an  image,  (a)  Image,  (b)  a  —  20,  (c) 
a  =  30,  (d)  a  =  40. 

The  region  information  and  fill  operations  isolate  the  object  from  the  background,  and 
the  closing  operation  eliminates  any  small  holes  in  the  silhouette.  The  silhouette  is  useful 
for  image  normalization,  as  will  be  discussed  in  the  next  section. 

7.1.2  Geometric  constraints 

Although  the  multiscale  region  analysis  provides  useful  features  for  brain  image  seg¬ 
mentation,  these  features  alone  are  inadequate  for  forming  a  meaningful  anatomical  brain 
image  segmentation.  For  example,  without  prior  knowledge  of  the  object  being  imaged,  it 
is  unclear  which  scale  levels  contain  the  proper  region  information  and  how  these  regions 
should  be  combined  or  split  to  create  a  meaningful  anatomical  structure.  Often  used 
techniques  use  region  growing,  split-and-merge,  and  linking  methods  [59,  75,  80,  85]  to 
form  a  segmentation  from  regions.  These  methods  typically  use  pixel  intensities,  gradi¬ 
ents,  and  region  similarity  as  the  criteria  for  combining  regions,  so  the  final  segmentation 
may  be  geometrically  incompatible  with  the  prior  knowledge  about  the  system. 

In  contrast,  the  method  described  herein  relies  on  geometric  information  to  constrain 
the  segmentation  process.  In  our  case  the  geometric  prior  information  is  in  the  form  of  a 
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template  and  a  deformable  model.  The  first  step  in  applying  the  template  to  the  image 
is  image  normalization. 

Image  normalization 

Normalization  is  achieved  by  a  rigid-body  (affine)  registration  of  the  template  and  study 
image.  The  practice  of  using  an  initial  low-dimensional  transformation  (such  as  affine) 
followed  by  a  more  localized  transformation  is  widely  used  [24,  26,  35,  86]  due  mainly  to 
the  fact  that  it  overcomes  local  minima  and  leads  to  faster  convergence. 

Many  rigid-body  registration  techniques,  including  those  compared  in  [87],  are  itera¬ 
tive  techniques  that  adjust  for  only  rotation  and  shift  differences  between  the  two  images. 
In  [24],  however,  Bajcsy  and  Kovacic  use  a  noniterative  method  which  compensates  for 
shift,  rotation,  and  scale.  This  method  is  usually  not  as  accurate  as  the  iterative  meth¬ 
ods,  but  since  normalization  will  be  followed  by  non-rigid-body  deformation,  speed  and 
simplicity  are  the  main  concern  here. 

The  algorithm  calculates  a  normalizing  affine  operator  Afd  which  transforms  d  to 
a  normalized  size  and  orientation.  The  calculation  consists  of  several  steps.  First,  a 
binary  silhouette  image  d&(x)  is  created  from  the  multiscale  region  information  (see 
Sections  7.1.1  and  7.3).  Second,  the  mass  center  of  the  silhouette  is  moved  to  the  image 
origin  to  remove  the  translational  offset.  Next,  major  and  minor  axes  are  found  using 
distances  of  silhouette  pixels  from  the  origin.  The  axes  are  calculated  using  an  eigenvector 
decomposition  of  a  covariance  matrix  for  the  silhouette  locations.  Finally,  a  rotation  is 
calculated  that  will  align  the  image  to  a  normalized  orientation.  These  steps  are  now 
formalized  in  detail. 

1.  Move  the  mass  center  of  the  silhouette  c  =  (cx,  cy)  to  the  origin,  where 

cx  =  ]T^d6(x) 

* 

cy  =  ^2vM^) 

X 
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2.  Calculate  the  eigenvalues,  namely  Ax,  A2,  and  eigenvectors  (with  unit  length),  namely 
v\,v2  of  a  covariance  matrix  M  where 


w(2,0)  «(1,1) 
«(1.1)  «(0,2) 


and  u(k,r)  is  the  central  moment  defined  as 


u(k,r)  =  ~  c*)k(v  -  ^^(x)  (7.2) 

x 


3.  Rotate  the  image  according  to  the  eigenvectors  of  M  so  that  covariance  matrix  is 
diagonalized: 


x'  =  j4(x  —  c) 


where 

A  —  [ux  v2]T  or  A  =  [ux  -v2]t 
such  that  the  diagonal  elements  have  the  same  sign. 

With  the  above  transform,  d(x')  becomes  invariant  to  translation,  shift,  and  scaling 
and  is  in  rough  alignment  with  the  template;  template  deformation  as  described  in  this 
thesis  can  be  performed.  The  deformation  process  will  bring  the  template  into  better 
geometrical  alignment  with  the  structure  of  the  study  image. 


7.1.3  Final  classification 

The  segmentation  given  by  the  deformed  template  may  not  be  accurate  for  several 
reasons,  including  the  following:  (1)  It  may  converge  to  a  local  minimum;  (2)  finding  the 
global  minimum  may  be  impractical;  (3)  even  if  the  globally  optimal  solution  is  obtained, 
there  is  no  guarantee  that  the  estimate  is  optimal  with  respect  to  natural  measures  such 
as  the  human  eye;  and  (4)  topological  differences  may  preclude  correct  segmentation.  The 
purpose  of  the  final  classification  is  to  combine  information  from  the  multiscale  region 
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analysis  with  that  obtained  through  template  deformation  to  obtain  a  more  meaningful 
result. 

Final  classification  is  achieved  by  defining  a  statistical  model  which  incorporates  in¬ 
formation  from  both  the  deformed  template  and  the  multiscale  pyramid.  Each  pixel  will 
then  be  classified  according  to  this  model.  The  probabilistic  model,  denoted  as  P{i,k), 
is  the  confidence  that  the  ith  image  pixel  belongs  to  class  k.  First,  let  P^(i,  k )  define 
the  confidence  that  the  ith  pixel  belongs  to  k  based  on  the  information  in  scale  j. 


Py)(i,fc)  = 


Area(i?^  (i)  D  Sk ) 
Area  (Pb)(i)) 


where  R^(i)  represents  the  region  in  the  jth  level  which  contains  pixel  i,  and  Sk  =  {i  \ 
at  —  k}.  P(i,  k )  is  the  weighted  sum  of  all  scales  where  the  weights  uij  >  0,  j  =  1, 2, ...,  J, 
convey  the  confidence  placed  in  each  scale  level.  Then 

j 

P(i,k)  =  £  UjP{j\i,k).  (7.4) 

j- 1 

If  ujj  are  restricted  such  that  ]Pj=i  =  b  then 

£P<»(i,k)  =  1  (7.5) 


^TP(m,k)  =  1  (7.6) 

fe= l 

so  both  P(j)(i,k)  and  P(i,  k)  are  valid  probability  mass  functions. 

Once  P(i,k)  has  been  calculated,  the  final  segmentation  s  of  the  study  image  d  is 
written  as 


s(i)  =  arg  max  P(i,  A:)  (7.7) 

k 

Notice  that  the  P^(i,  k )  maintain  the  region  information  of  both  the  multiscale  analysis 
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and  the  deformed  template;  for  example, 


o  if  £«>(*)  n  sk  =  0 

1  if  R&(i)cSk 


(7.8) 


7.2  Implementation  Issues 

The  final  classification  was  found  to  be  sensitive  to  values  of  the  homogeneity  scale 
parameter  a  used  in  the  multiscale  region  analysis.  Using  many  small  values  of  a  (e.g., 
1,  2,  3,  4,  5,  6,  7)  and  few  large  values  of  a  (e.g.,  30,  40)  can  cause  an  improper  bias  by 
relying  too  much  on  the  finer  scales.  Furthermore,  values  of  sigma  which  are  appropriate 
for  a  particular  image  contrast  may  not  be  useful  for  a  different  input  image  with  a 
different  contrast.  In  general,  values  for  a  should  be  selected  so  that  significant  structural 
differences  exist  for  different  values  of  a.  One  such  measure  is  to  monitor  the  number 
of  distinct  values  the  region  grayscale  averages  take  on.  Using  this  measure,  values  are 
chosen  such  that  the  number  of  distinct  region  averages  strictly  decreases  for  larger  values 
of  cr  used.  In  [77],  Tabb  and  Ahuja  discuss  structures  that  are  perceptually  relevant  based 
on  the  stability  of  the  structures  when  a  is  changed.  Similar  analysis  could  be  done  to 
decide  which  a  values  should  be  used  for  our  purposes.  This  is  an  area  for  further 
research. 

Once  useful  values  for  a  have  been  found,  the  confidence  placed  in  scale  level,  as 
denoted  by  Uj  in  (7.6),  is  determined.  Assuming  that  the  deformed  template  result  is 
a  fairly  good  estimate  of  the  segmentation,  loj  is  set  to  be  proportional  to  the  inverse 
of  the  average  number  of  Sk  spanned  by  the  deformation  regions  in  level  j.  Thus,  more 
confidence  is  placed  in  scales  that  span  fewer  regions.  This  has  the  effect  of  weighting 
the  finer  scales  slightly  more  heavily  than  the  coarse  scales.  Results  for  other  values  of 
ojj,  such  as  uniform  weighting,  have  not  been  found  to  dramatically  change  the  classi¬ 
fication  result.  Other  machine  learning  algorithms  are  being  investigated  to  determine 
the  suitability  of  each  multiscale  level.  One  such  method  is  the  use  of  neural  network 
training  of  the  values  of  u)j . 
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7.3  Experimental  Results 


Experiments  were  conducted  using  optical  cryosection  and  magnetic  resonance  (MR) 
brain  images  from  the  VISIBLE  HUMAN  database  and  MR  brain  images  from  [88]  and 
local  sources. 

A  test  was  conducted  to  assess  the  ability  to  extract  the  outer  contour  of  the  head  from 
the  image  using  the  multiscale  region  analysis.  The  outer  contours  of  610  2-D  256  x  256 
cryosection  and  MR  images  were  estimated.  Each  image  was  segmented  using  multiscale 
region  analysis.  The  scale  corresponding  to  a  —  30  was  operated  on,  using  morphological 
fill  operations  and  a  closing  operation  to  separate  the  head  from  the  background  and 
remove  small  gaps  in  the  contour.  One  set  of  images  from  a  local  source,  comprising 
about  20%  of  images,  was  found  to  not  fully  utilize  the  dynamic  range  of  the  grayscale 
space  {0,255},  so  the  set  was  linearly  scaled  to  use  the  entire  range.  The  other  images 
were  not  preprocessed. 

Since  the  contour  is  used  for  image  normalization  purposes,  the  silhouette  extraction 
performance  was  measured  by  assessing  the  deviation  of  the  normalized  image  from 
the  ‘correct’  orientation  as  determined  by  a  human  observer.  Contours  for  which  the 
normalization  procedure  was  able  to  transform  the  image  to  within  3°  of  rotation,  or  3% 
of  the  image  size  in  translation,  shift,  or  scale  of  the  correct  orientation  were  considered 
correctly  normalized.  Of  the  610  images,  the  contour  feature  was  sufficiently  extracted 
to  correctly  normalize  all  but  10  of  the  images.  The  processing  was  modified  to  allow 
a  =  20,  30,  or  40,  followed  by  manual  selection  of  the  best  one.  Using  this  method, 
all  but  two  of  the  images  were  correctly  normalized.  Figure  7.3  shows  an  image  and  an 
incorrect  silhouette  using  o  —  30  but  a  correct  silhouette  using  a  —  20.  In  this  case 
selecting  too  coarse  a  scale  causes  part  of  the  head  to  be  considered  background.  The 
final  two  images  required  manual  intervention  to  be  correctly  normalized. 

Tests  were  conducted  on  the  ability  of  the  segmentation  method  to  extract  white 
matter  and  gray  matter  regions  from  a  brain  image.  The  template  was  deformed  to  each 
of  the  images  using  the  deformable  method  outlined  in  this  thesis  (deformable  result). 
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Figure  7.3  Extracting  the  brain  silhouette.  Left:  An  image,  Center:  the  silhouette 
extracted  using  a  =  20,  and  Right:  a  =  30.  a  =  20  is  the  better  choice  for  this  image. 

The  deformable  result  was  combined  with  a  multiscale  region  segmentation  result  as 
outlined  in  Section  7.1.3  (combined  result).  Figures  7.4  and  7.5  show  results  from 
experiments  conducted.  The  first  column  in  each  figure  shows  the  study  images,  the 
second  column  shows  the  template  deformation  result,  and  the  third  column  shows  the 
final  classification  based  on  the  merging  of  both  the  template  deformation  and  multiscale 
analysis.  In  Figure  7.5,  the  final  column  shows  the  correct  manual  classification  as  done 
by  an  expert  [88].  As  is  evident  from  the  results,  the  combined  result  improves  the 
segmentation  in  some  areas,  but  seems  to  degrade  other  areas.  The  combined  result 
improves  the  deformation  in  some  of  the  fingerlike  regions,  where  the  deformation  is  not 
able  to  entirely  deform.  In  the  images  of  Figure  7.5,  the  combined  result  overextends  the 
white  matter  portions. 

Areas  of  the  manual  segmentations  (column  4  of  7.5)  show  deep  grooves  in  outer 
portion  of  the  gray  matter.  These  areas  correspond  to  cerebral  spinal  fluid  (CSF).  The 
template  contains  five  graylevels  to  capture  scalp,  skull,  gray  matter,  white  matter,  and 
background.  In  reality  there  are  many  more  tissue  types  present,  including  CSF,  fat, 
muscle,  and  blood.  As  explained  in  Section  4.5,  using  too  many  graylevels  in  the  process 
leads  to  deformation  instability.  Thus,  there  is  a  trade-off  between  stability  and  the 
number  of  tissues  tracked. 
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(a)  (b)  (c) 


(d)  (e)  (f) 


(g)  00  0) 

Figure  7.4  Comparison  between  segmentation  based  on  template  deformation  and  the 
combination  of  template  deformation  and  multiscale  region  analysis.  Column  1:  Study 
images.  Column  2:  Template  deformation.  Column  3:  Segmentation  combining  defor¬ 
mation  and  region  analysis. 
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(a)  (b)  (c)  (d) 


(e)  (f)  (g)  (h) 


(i)  (j)  00  0) 


Figure  7.5  Comparison  between  segmentation  based  on  template  deformation  and  the 
combination  of  template  deformation  and  multiscale  region  analysis  (continued) .  Column 
1:  Study  images.  Column  2:  Template  deformation.  Column  3:  Segmentation  combining 
deformation  and  region  analysis.  Column  4:  Manual  segmentation  result. 
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7.4  Summary 


This  chapter  presented  a  segmentation  framework  which  is  under  development  at  the 
University  of  Illinois  Beckman  Institute.  The  relationship  was  shown  between  the  de¬ 
formable  method  discussed  earlier  in  this  thesis  and  the  rest  of  the  segmentation  frame¬ 
work.  Both  a  template  deformation  and  multiscale  region  analysis  are  combined  to 
achieve  a  final  classification. 

Experiments  were  conducted  to  test  the  effectiveness  of  the  template  deformation 
within  the  framework.  Comparison  against  manually  segmented  data  shows  that  the 
deformed  result  is  already  quite  good.  The  combined  result  did  improve  the  result  in  some 
cases,  but  led  to  an  overextension  of  the  white  matter  in  other  cases.  While  the  theory 
and  experimental  results  provide  hope  that  accurate  segmentation  can  be  automatically 
achieved,  further  research  is  still  needed  in  this  area. 
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CHAPTER  8 


CONCLUSION 


This  thesis  presented  a  systematic  study  of  deformable  image  transformations.  A 
method  was  proposed  for  nonrigidly  aligning  a  template  with  a  study  image.  This 
method  is  robust  to  a  wide  variety  of  contrast  variations  and  supports  large,  curved 
deviations  from  the  original  template.  The  deformations  are  chosen  in  accordance  with 
maximization  of  the  mutual  information  between  the  image  and  template,  a  viscous  fluid 
model,  and  learned  shape  information.  Potential  applications  of  the  method  include 
image  segmentation,  functional  brain  mapping,  and  automatic  target  recognition. 

The  thesis  research  contributes  to  the  field  of  deformable  image  transformations  in 
several  ways.  First,  an  in-depth  study  of  mutual  information  was  conducted,  as  it  applies 
to  nonrigid  template  deformation.  A  study  of  the  bias  and  variance  of  the  MI  between 
images  was  done,  and  conditions  for  optimality  of  MI  were  defined.  Limitations  based  on 
these  conditions  were  investigated  and  ways  of  mitigating  the  impact  of  these  limitations 
were  presented.  Templates  with  fewer  graylevels  and  a  distinct  graylevel  for  each  struc¬ 
ture  were  more  suitable  for  Mi-based  deformation.  A  formula  was  derived  to  convert  MI 
to  a  body  force.  The  calculation  was  found  to  scale  more  favorably  with  image  size  than 
previously  reported  calculations. 

A  bound  on  the  maximal  mutual  information  was  derived.  The  bound  can  be  calcu¬ 
lated  by  maximizing  the  entropy  of  the  template  under  the  constraint  that  the  conditional 
entropy  of  the  template  be  zero,  given  the  study  image.  Not  only  may  the  bound  prove 
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useful  in  template  initialization  and  selection,  but  it  also  provides  insight  into  the  use¬ 
fulness  of  MI  in  a  nonrigid  setting.  It  was  demonstrated  that  templates  which  achieve 
the  bound  may  not  give  meaningful  segmentations  of  the  study  image.  Thus,  the  need 
for  constraints  on  the  deformation  were  clearly  shown. 

A  fluid  model  was  developed.  The  model  maintains  desirable  properties  of  an  existing 
fluid  model  while  being  computationally  more  efficient.  The  fluid  partial  differential 
equation  models  the  Markov  dependencies  between  template  pixels.  Experimental  results 
showed  that  the  fluid  model  allows  large,  curved  deformations,  but  also  allows  over¬ 
deformation. 

The  shape  model  presented  in  Chapter  6  was  shown  to  reduce  the  over-deformations 
inconsistent  with  prior  knowledge  about  the  shape  of  the  skull  and  scalp,  while  still 
allowing  large,  nonlinear  deformations  in  other  parts  of  the  template.  The  model  captures 
both  inter-  and  intrashape  variations  among  distant  parts  of  the  template. 

The  proposed  deformable  template  model  was  incorporated  into  a  brain  image  seg¬ 
mentation  method  under  development  at  the  Beckman  Institute  for  Advanced  Science 
and  Technology  on  the  University  of  Illinois  at  Urbana-Champaign.  The  segmentation 
method  was  used  to  segment  white  and  gray  matter  from  brain  images.  Results  were 
compared  to  manually  segmented  images.  The  results  were  mixed.  Although  the  com¬ 
plete  segmentation  method  improved  the  deformable  template  in  some  areas,  other  areas 
were  degraded. 

8.1  Areas  of  Further  Research 

Through  the  course  of  this  research  several  possible  avenues  to  pursue  have  been 
discovered;  they  are  discussed  here. 

8.1.1  Incorporating  other  types  of  learning 

Machine  learning  has  been  applied  to  the  deformable  procedure  in  the  form  a  shape 
model  trained  by  previously  deformed  templates.  Other  types  of  learning  may  also  be 
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useful  to  the  process.  For  example,  as  the  deformation  progresses  through  time,  certain 
deformation  sequences  tend  to  occur.  An  associative  memory  may  be  able  to  retrieve  the 
entire  sequence  from  the  first  few  steps. 

8.1.2  Better  integration  of  deformation  and  region  information 

Within  the  developmental  segmentation  framework,  template  deformation  and  mul¬ 
tiscale  region  segmentation  are  done  autonomously,  then  combined  for  the  final  result. 
Incremental  inclusion  of  the  region  information  into  the  deformation  process  may  both 
improve  the  result  and  speed  up  the  process.  For  example,  deformations  may  be  preferred 
which  do  not  cross  region  boundaries.  Resistance  could  be  applied  proportional  to  the 
number  of  scales  for  which  boundaries  are  violated. 

8.1.3  Hierarchical  deformation 

Research  has  been  previously  reported  for  different  ways  to  formulate  deformation 
procedures  in  a  hierarchical  fashion.  Proposed  methods  include  multiple  resolutions  of 
the  image  [24]  and  multiple  degrees  of  fluidity  or  elasticity  [46,  89].  Discussions  in  Sections 
4.5  and  7.3  suggest  another  hierarchical  formulation  of  the  problem.  A  trade-off  was  found 
between  the  number  of  graylevels  in  the  template  and  the  stability  of  the  deformation 
process.  A  possible  modification  of  the  process  would  deform  to  the  large  distinguishable 
structures  first,  such  as  white  matter,  gray  matter,  bone  and  skin;  followed  by  smaller 
structures  such  as  CSF  or  muscle. 

8.1.4  Other  data  sets 

Although  there  is  no  known  limitation  on  the  theory  that  would  preclude  application 
of  the  deformable  model  to  higher  than  two  dimensions,  investigation  may  uncover  diffi¬ 
culties  associated  with  3-D  or  higher  deformations.  Additionally,  it  would  be  interesting 
to  apply  the  deformable  method  to  other  than  medical  images.  The  procedure  could  find 
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application  in  military  or  other  commercial  applications  where  image  understanding  is 
reliant  on  both  prior  information  and  image  features. 


119 


APPENDIX  A 


IDENTITIES  IN  INFORMATION  THEORY 

The  following  are  some  properties  of  mutual  information  and  entropy: 

H{X)  >  0  (A.l) 

MI(X-,Y)  >  0  (A. 2) 

H(X\Y)  <  H(X)  (A.3) 

MI(X;Y)  =  MI(Y]X)  (A.4) 

MI{X-X)  =  H(X)  (A. 5) 

MI(X]Y)  <  min (H(X),H(Y))  (A.6) 

MI(X]Y)  <  ( H(X)  +  H(Y))/2  (A.7) 

MI(X\Y)  <  ma x(i?(X), /7(F))  (A.8) 

MI{X]Y)  <  H(X,  Y)  (A. 9) 

MI{X-Y )  <  H(X)  +  H(Y)  (A. 10) 

MI{X-Y)  =  H(X)-H{X\Y)  (A.ll) 

MI(X-Y)  =  H(Y)  —  Y(Y\Y)  (A.12) 

If  X  and  Y  are  independent,  some  of  the  above  inequalities  turn  into  equalities: 

MI{X-Y)  =  0  (A. 13) 

H(X  |  Y)  =  H(X)  (A. 14) 
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Additionally,  for  independent  X  and  Y,  and  a  one-to-one  function  F(-) 


H(f(X)  +  Y\X) 

=  H(X) 

(A.15) 

H(X  +  Y) 

>  H(X) 

(A.  16) 

H{X  +  Y) 

>  H(Y) 

(A.17) 
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APPENDIX  B 


DERIVATION  OF  THE  FLUID  MODEL 

FUNCTIONAL 


The  calculus  of  variations  [90]  states  that  F(v)  is  stationary  at  v  if 


d 


^(v  +  o/)U  =  0 


Let 


d 

5F  =  lim  7 —F(v  +  erj) 
c-^o  de 


Applying  (B.l)  to  the  first  term  on  the  right-hand  side  of  (5.8)  gives 
F(v  +  erj)  =  J  |  Vv  +  Ver]  |2  dx. 

=  /(Vv  +  Ve#(W  +  Ve,)dx 
=  /(Vvf(Vv)dx  +  2e  J (Vv)T(Vr7)  +  e2  |  Vr/ 
So  the  variation  of  F  based  on  (B.l)  is 

5F  =  2  J  (Vv)r(V?7)dx 
=  2  J  V  •  77VV  —  ?yV2vdx 


COST 


(B.l) 


(B.2) 

dx 


(B-3) 


The  second  line  follows  from  the  identity  [91] 


Using  the  divergence  theorem  leads  to 


5F  =  2  (p  rjVv  ■  hdx  —  2  rj  ■  V2vdx 

Jdx  J 

=  -2  f„.  V2vdx 


(B.4) 


where  the  surface  integral  is  zero  due  to  boundary  conditions.  Since  we  must  ensure  that 
B.4  goes  to  zero  for  arbitrary  77,  it  must  be  that  Vv(x)  =  0 
Similar  derivation  can  be  done  for  the  second  term  of  (5.8): 


F(v  +  erj)  =  J  |  V  •  v  +  V  •  erj  |2  dx 

=  J (V  •  v  +  V  •  e?7)T(V  •  v  +  V  •  e?7)dx 
=  J (V  •  v)T(V  •  v)dx  +  2e  J (V  •  v)T(V  •  77)  +  e2  |  V  •  rj  |2  dx 


So  the  variation  of  F  based  on  (B.l)  is 


which  using  identity 


becomes 


8F  =  2  /(V-  v)T(V  •  77)* 


(V  •  v)(V  -  77)  =  V  •  (77V  •  v)  —  77  •  VV  •  v 


5F  =  2  J  V  •  (77V  •  v)dx  —  J  77  •  VV  •  vdx 


The  second  line  follows  from  the  identity  [91] 


V  •  aV6  =  Va  •  V6  +  aV2& 


Using  the  divergence  theorem  leads  to 


5F  =  2  (j)  (V  •  v)t7  •  ndx  —  2  /  77  •  VV  •  vdx 

Jox  J 


(B.5) 


(B.6) 


(B.7) 


(B.8) 


Again  the  surface  integral  goes  to  zero  due  to  boundary  conditions,  and  it  must  be  that 
VV  -  v  =  0  for  the  functional  to  be  stationary. 
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